INTRODUCTION 

TO 

MATHEMATICAL  STATISTICS 


BY 


CARL  J.  WEST,  Ph.D., 

ASSISTANT  PROFESSOR  OF  MATHEMATICS 
OHIO  STATE  UNIVERSITY 


COLUMBUS 

R.  G.  ADAMS  AND  COMPANY 
1918 


COPYRIGHT,  1918 

BY 
CARL  J.  WEST 


PRESS  OF 

THE  F.  J.  HEER  PRINTING  CO. 
COLUMBUS.  OHIO 


PREFACE. 

IT  is  the  aim  of  this  book  to  present  certain  topics  of 
elementary  statistical  theory  which  have  been  found  useful 
and  workable. 

The  statement  would  seem  warranted  that  no  more  than 
the  very  simplest  methods  should  be  used  by  one  who  has  no 
knowledge  of  the  principles  underlying  the  methods.  Busy 
though  the  scientist  may  be,  he  owes  it  to  the  science  and  to 
the  persons  who  may  accept  his  results  to  have  some  familiarity 
with  his  tools.  The  blind  application  of  formulas  in  statistics 
has  been  made  possible  by  the  convenient  manuals  that  have  ap- 
peared and  has  been  encouraged  by  the  fact  that  the  theory  has 
been  so  surrounded  by  intricate  and  involved  mathematics  that  it 
was  only  by  an  extended  research  that  a  knowledge  of  the  theory 
could  be  obtained. 

There  is  no  real  reason  why  the  theory  of  statistical  methods 
should  remain  in  obscurity.  The  necessary  mathematics  is 
largely  elementary  arithmetic  and  except  in  a  few  cases  there  is 
no  need  for  higher  mathematics.  This  book  presupposes  a 
reasonable  familiarity  with  elementary  mathematics  only. 

Because  of  the  desire  to  eliminate  higher  mathematics  from 
the  body  of  the  book  the  discussion  of  the  theory  of  the  Gen- 
eralized Frequency  Curves  of  Pearson  has  been  deferred  to  Ap- 
pendix I.  For  the  same  reason  a  discussion  of  the  promising 
method  of  variate  differences  is  omitted,  as  is  the  mathematical 
theory  of  random  selection. 

While  it  is  hoped  that  the  statistical  data  of  this  book  may 
be  of  interest  in  themselves  they  have  been  selected  solely  with 
reference  to  their  usefulness  in  illustrating  the  theory.  For  this 
reason  all  examples  and  exercises  have  to  do  with  very  simple 
data.  The  author  will  appreciate  notice  of  such  numerical  and 
other  inaccuracies  as  may  be  found. 

The  idea  is  emphasized  that  a  formula  or  method  to  be  of 
practical  and  trustworthy  value  to  a  statistician  must  be  so 
simple  and  direct  that  the  final  results  can  be  interpreted  in  terms 

(1) 


3760/8 


2  PREFACE 

of  the  original  conditions  or  the  given  data.  To  illustrate,  if  the 
arithmetic  mean  is  ten  per  cent,  larger  in  one  distribution  than  in 
another  what  difference  does  this  variation  indicate  in  the  forms 
of  the  distributions  or  in  the  values  of  the  two  series  of  measure- 
ments? If  one  correlation  ratio  is  0.54  and  a  second  0.59  how 
much  more  closely  related  are  the  attributes  in  the  second  than 
in  the  first?  It  must  always  be  remembered  that  mathematics  is 
but  a  tool  to  be  used  when  the  desired  results  can  be  more 
efficiently  attained  by  its  use,  and  that  a  formula  is  nothing  more 
than  a  statement  in  mathematical  language  of  a  method  of  com- 
putation already  thought  out  and  understood.  The  difficulties 
that  may  arise  in  this  subject  are  not  primarily  mathematical. 
They  are  essentially  a  part  of  the  necessarily  difficult  task  of 
analyzing  a  statistical  distribution. 

The  preparation  of  a  book  on  mathematical  statistics  to 
appeal  to  scientific  workers  in  fields  ordinarily  considered  to 
be  non-mathematical  is  essentially  a  matter  of  experimentation. 
It  is  the  hope  of  the  author  that  this  book  may  stimulate  interest 
in  the  methods  of  presenting  statistical  theory  and  in  the  more 
inclusive  problem  of  making  mathematical  theory  more  widely 
available.  Any  suggestions  or  criticism  of  this  presentation  will 
be  appreciated. 

The  Bibliography  of  Appendix  II  is  inserted  as  a  guide  to 
advanced  reading  in  the  subject  of  mathematical  statistics;  the 
contributions  of  Prof.  Pearson  are  to  be  noted  especially. 

It  seems  hardly  necessary  to  refer  to  the  debt  which  any- 
one who  works  in  statistical  theory  must  owe  to  Professor 
Karl  Pearson.  Because  of  his  "Tables  for  Statisticians  and 
Biometricians"  the  formulas  of  Appendix  I  are  not  given  in 
more  detail. 

Professor  James  McMahon  has  given  most  generously  of 
his  time  and  interest.  Whatever  assistance  this  book  may  afford 
to  the  practical  worker  in  statistics  is  in  a  large  measure  due  to 
the  influence  of  Professor  Walter  F.  Willcox,  whose  critical 
insight  into  the  limitations  and  the  possibilities  of  statistical 
methods  together  with  the  originality  and  practical  initiative 
which  permeate  his  research  and  instructional  work  place  all 
his  students  under  obligations  to  him. 


CONTENTS. 

CHAPTER  I.  PACK 

CURVE  PLOTTI XG    7 

Plotting  the  Data. 

General  Directions   for  the  Laying-off  of   Scales. 

Connecting  the  Plotted  Points. 

Directions  for  Plotting  Curves. 

The  Title  of  a  Diagram. 

More  than  one  Curve  on  the  Same  Diagram. 

Coordinates. 

Logarithmic  Curves. 

Cumulative  Curves. 

CHAPTER  II. 

CURVE   PLOTTING    (Continued) 16 

Interpolation. 

The  Smoothing  of  a  Curve. 

Smoothing  by  Inspection. 

The  Preservation  of  Areas. 

The  Adjusted  Data :    Interpolation. 

Test  of  a  Graduation. 

Determining  the  General  Trend  of  the  Data. 

Periodic  Data. 

CHAPTER  III. 
FRKOI-EXCY  CURVES   24 

Definitions. 

The  Construction  of  a  Frequency  Distribution. 

Plotting  a   Frequency  .Distribution. 

Smoothing  the  Frequency  Distribution. 

Use  of  the  Frequency  Curve. 

Errors  in  Representative  Data. 

CHAPTER  IV. 

AVERAGES    32 

The  Arithmetic  Mean. 

Statistical  Properties  of  the  Arithmetic  Mean. 
Theorem  on  the  Sum  of  Deviations  from  the  Mean. 
The  Weighted  Arithmetic  Mean. 

(3)  . 


CONTENTS 

PAGE 


Adjustment  or  Graduation  Formulas. 

The  Geometric  Mean. 

Properties  of  the  Geometric  Mean. 

The  Median. 

Quartiles. 

Deciles. 

Statistical   Properties  of  the  Median. 

The  Probable  Deviation. 

The  Mode. 

Statistical   Properties  of  the   Mode. 


CHAPTER  V. 
THE  FORM  OF  THE  DISTRIBUTION 45 

Dispersion. 

Measures  of  Dispersion. 

Mean  Deviation. 

Proof  that  Mean  Deviation  Smallest  about  the  Mean. 

Statistical  Properties  of  the  Mean  Deviation. 

The  Mean  Squared  Deviation. 

Short  Rule  for  the  Mean  Squared  Deviation. 

The  Standard  Deviation. 

Properties  of  the  Standard  Deviation. 

The  Coefficient  of  Variability. 

The  Quartiles  as  Measures  of  Dispersion. 

Formula  for  the  Probable  Deviation. 

Probable   Deviation   of  the   Arithmetic   Mean. 

Probable  Deviation  of  the  Standard  Deviation. 

Statistical  Significance  of  the  Probable  Deviation. 

The  Deciles  as  Measures  of  Dispersion. 

Symmetrical  and  Asymmetrical  Distributions. 

The  Position  of  the  Averages  and  Asymmetry. 

Skewness. 

Measures  of  Skewness. 


CHAPTER  VI. 
THE  NORMAL  PROBARILITY  CURVE 59 

The  Equation  of  a  Frequency  Curve. 
Statistical  Theory  of  the  Normal  Curve. 
The  Equation  of  the  Normal  Curve. 
The  Graph  of  the  Normal  Equation. 
Areas  under  the  Normal  Curve. 
Preliminary  Determination  of  Normality. 
Probable  Deviation  in  a  Normal  Distribution. 


CONTENTS  5 

CHAPTER  VII.  PACK 

THE   CORRELATION    TABLE 67 

An  Illustration. 

The  Construction  of  a  Correlation  Table. 

Definitions  and   Symbols. 

Correlation. 

CHAPTER  VIII. 

THE  CORRELATION  RATIO 74 

The  Mean  as  Representative  of  the  Array. 

Regression  Curves. 

Coordinate  Axis. 

Correlation  and  Regression  Curves. 

Mean  Squared  Deviation  of  the  Means  of  the  Array. 

The  Correlation  Ratio. 

Two  Values  of  the  Correlation  Ratio  for  each  Table. 

Limiting  Values  of  the   Correlation  Table. 

Probable   Deviation  of  the   Correlation   Ratio. 

Spurious  Correlation. 

CHAPTER  IX. 

THE  COEFFICIENT  OF  CORRELATION 81 

Linear  Regression. 

The  Equations  of  the  Lines  of  Regression. 

The    Coefficient   of    Correlation. 

Computation  of  r. 

The  Relation  between  r  and  i). 

Limiting  Values  for  r. 

Statistical  Properties  of  the  Coefficient  of  Correlation. 

Test  for  Linearity  of  Regression. 

Probable  Deviations. 

CHAPTER  X. 

CORRELATION  FROM   RANKS 87 

Rank  in  a  Series. 
Theorems. 
Ties  in  Rank. 

The  Bracket  Rank  Method. 

The  Mid-Rank  Method. 

Probable  Deviation  of  the  Rank  Coefficient. 
Perfect  Rank  Correlation. 
Uncorrelated  Data. 
Correction  Formula  for  the  Rank. 
Coefficient. 
Corresponding  Values  of  ?*x    and  ^VKv- 


6  CONTENTS 

PAGE 

Probable  Deviation  of  Txy  from  Ranks. 

Theorems. 

Accuracy  of  the  Coefficient  Txy.    When   Computed   from   Ranks. 

CHAPTER  XL 

THE  MOMENTS  OF  A  DISTRIBUTION 95 

Introduction. 

Transformation  Formulas. 

Summation  Methods. 

Correction  Formulas  for  the  Moments. 

Theorems. 

Summations. 

The  Moments  and  the  Equation  of  the  Smoothed  Curve. 

CHAPTER  XII. 

FURTHER  THEORY  OF   CORRELATION 108 

A  Second  Concept  of  Correlation. 

Derivation  of  the  Equations  of  the  Regression  Lines. 

The  Relation  between  r  and  -n. 

The  Coefficient  r  for  non-linear  Regression. 

The  Most  Probable  Value  of  a  Characteristic. 

Theorems. 

Correlation  of  Indices. 

CHAPTER  XIII. 
THE  METHOD  OF  CONTINGENCY 119 

The  Mean  Squared  Contingency. 

Properties  of  0. 

Non-Quantitative   Characteristics. 

The  Four-fold  and  the  Nine-fold  Tables. 

Theorems. 

Appendix  I.     The  Frequency  Curves  of  Pearson 131 

Appendix  II.     Bibliography   145 


CHAPTER  I. 
CURVE  PLOTTING. 

Plotting  the   Data.     Let  us  plot  the   following  data*  of 
the  monthly  precipitation  at  Columbus  for  the  year  1916: 


January,  5.0  inches 

February,  1.5  inches 

March,  4.9  inches 

April,  2.3  inches 


May, 
June, 
July, 
August, 


4.8  inches 
3.5  inches 
0.7  inches 
3.2  inches 


September,  1.5  inches 
October,  1.8  inches 
November,  1.6  inches- 
December,  3.6  inches 


A  horizontal  straight  line  is  first  drawn  and  at  equal  distances 
on  this  line  twelve  points  are  located,  one  for  each  month.  On  a 
vertical  line  erected  at  the  point  corresponding  to  the  month  of 
January  equal  intervals  are  laid  off,  one  for  each  inch  of 
precipitation ;  and  these  intervals  are  subdivided  into  tenths.  The 
two  series  of  points  are  called  the  scales.  It  is  usual  to  des- 
ignate the  horizontal  and  the  vertical  scale  lines  by  O  —  X  and 
O  —  Y  respectively,  as  in  Figure  I. 


! 

CT 
a 

P 

< 

> 

i          C' 

:     s 

a 

->      c 

H: 

> 

r- 

^ 

ti 

< 

3        P 

a 

u      *- 
c 
c 

> 

c 

c 

2 

o     2 

G9 

S       1 

O 

S    o 


FIG.  I.     Monthly  Precipitation  at  the  Columbus  Station  for  the 
year   1916. 

The  January  precipitation  is  5.0  inches.  Place  a  dot  above 
the  January,  or  beginning  point,  at  a  height  corresponding  to  5.0 
inches  on  the  vertical  scale.  The  next  point  is  directly  above  the 
second  or  February  point  at  a  distance  corresponding  to  1.5 
inches.  Continuing  in  this  way  we  locate  a  point  for  each  month ; 
the  data  is  then  said  to  be  plotted  or  pictured  point  by  point. 

*  Annual  Meteorological  Summary,  U.  S.  Weather  Bureau,  Colum- 
bus, Ohio.  1917. 

(7)   ' 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS 


Exercises. 

1.     Plot   the    following   March   precipitation    data.* 


1879 

38 

1889  

0.7 

1899  .  . 

4  7 

1909 

..   27 

1880 

.  ..   24 

1890  

5.6 

1900 

2  6 

1910 

0  3 

1881 

4.0 

1891  

4  6 

1901 

1  8 

1911 

24 

1882 

48 

1892  .... 

2  2 

1902 

2  6 

1912 

46 

1883 

3  2 

1893 

1  9 

1903 

4  1 

1913 

8  1 

1884 

36 

1894 

1  8 

1904 

4  9 

1914 

..   2.5 

1885 

....   0.5 

1895  .... 

1.2 

1905 

1  9 

1915 

1.2 

1886 
1887 
1888 

3.9 
2.6 
3.8 

1896  
1897  
1898  . 

3.0 
5.5 

7.0 

1906  .. 
1907  .. 
1908  . 

..  4.6 
..  5.2 
6.0 

1916 

4.9 

2.     Plot  the  following  population  data  for  the  United  States : 


1790    3,929,214 

1800    5,308,483 

1810    7,239,881 

1820    9,638,453 

1830    12,866,020 


1840 17,069,453 

1850    23,191,876 

1860    31,443,321 

1870    38,558,371 

1880    50,155,783 


1890    62,947,714 

1900    75,994,575 

1910    91,872,266 


In  plotting  this  data  take  the  numbers  to  the  nearest  million. 

General  Directions  for  the  Laying  off  of  Scales.  The 
object  of  any  graphic  representation  of  statistical  data  is  to  pre- 
sent a  vivid  picture  and  therefore  a  diagram  too  small  or  too 
large,  or  too  wide  or  too  narrow  will  not  accomplish  this  purpose 
as  efficiently  as  will  a  correctly  proportioned  diagram.  This 
means  that  the  widths  of  the  horizontal  and  the  vertical  scale 
intervals  must  be  carefully  chosen  in  order  to  give  the  complete 
diagram  the  proper  proportions. 

In  determining  the  widths  of  the  intervals  account  must  be 
taken  of  the  nature  of  the  statistical  material.  If  the  data  is  so 
inaccurate,  for  instance,  that  the  measurements  can  be  determined 
only  to  the  nearest  million  it  would  be  improper  to  divide  the 
scale  into  intervals  corresponding  to  thousands.  The  wealth  of 
the  country  and  the  value  of  manufactured  articles  are  examples 
of  statistics  which  do  not  admit  of  close  subdivision. 

It  is  useless  to  have  the  scale  intervals  finer  than  the  smallest 
difference  which  the  eye  can  conveniently  distinguish  on  the  dia- 


*Annual  Meteorological  Summary,  U.  S.  Weather  Bureau;   Colum- 
bus, Ohio.    1917. 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS 


9 


gram.  This  often  means,  even  in  the  case  of  quite  accurate  ma- 
terial, that  the  figures  of  the  data  must  be  cut  back;  for  in- 
stance in  plotting  population  data  for  the  United  States  one  mil- 
lion nny  be  the  smallest  numerical  difference  that  can  be  pictured 
on  an  ordinary  sized  diagram. 

Usually,  as  in  Figure  II,  horizontal  and  vertical  lines,  called 
coordinate  lines,  are  drawn  to  assist  in  carrying  the  divisions 
of  the  scales  across  the  diagram.  Care  must  be  taken  that  these, 
lines  are  lightly  drawn  and  are  not  more  numerous  than  is  neces- 
sary. 

Connecting  the  Points.  The  eye  is  assisted  in  passing 
across  a  diagram  if  the  plotted  points  are  connected  by  a  curve. 
The  curve  may  be  either  a  series  of  broken  straight  lines  joining 
the  points  or  a  continuous  curve  passing  thru  each  point  without 
sharp  angles  or  abrupt  changes  in  direction.  Of  the  two  methods 
the  continuous  curve  is  usually  to  be  preferred  because  of  the 
better  appearance  which  it  presents.  In  Figure  II  the  points  are 
connected  by  straight  lines  and  in  Figure  III  a  continuous  curve 
is  drawn. 

Exercises. 

3.  Plot  the  curve  of  the  1916  rainfall  at  Columbus  from  the  data 
of  Exercise  1. 


FIG.    II.     The    Plotted    Points    of         FIG.   III.     The   Points   of   Fig.   II 
Monthly     Temperatures     connected      connected  by  a  continuous  curve, 
by  straight  lines. 

4.     Plot  the  population  curve  from  the  data  of  Exercise  2. 


IO 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS 


Directions  for  Plotting  Curves. 

1.  The    general    arrangement    of    a    diagram    should    be    from    left 
to  right  and  from  bottom  to  top. 

2.  Figures  for  the  scales  of  a  diagram  should  ordinarily  be  placed 
at  the  left  and  along  the  bottom. 

3.  Whenever   practicable,    the    vertical    scale    should   be   so    chosen 
that  the  zero  line  will  appear  on  the  diagram.     When  this  is  not  done 
it  is  well  to  indicate  that  fact  by  a  break  in  the  diagram. 

4.  The   zero   lines    must   be    sharply    distinguished    from   the   other 
coordinate  lines  of  the  diagram. 

5.  The  curve  must  be  carefully  distinguished   from  the  coordinate 
lines. 

6.  The    data    should    accompany    the    diagram    either    in   the    form 
of  a  tabular  statement  or  placed   directly  on  the  diagram.     The  latter 
method  of  presenting  the  original  data  can  sometimes  be  effectively  used, 
especially  when  the  number  of  items  is  not  large. 

Underlying  all  rules  for  the  construction  of  statistical  dia- 
grams is  the  general  direction :  The  diagram  must  be  so  ar- 
ranged as  to  present  the  data  most  effectively.  Because  of  the 
great  diversity  of  statistical  material  and  of  the  wide  variety  of 
purposes  for  which  data  may  be  collected  and  presented  it  is 
not  possible  to  lay  down  specific  rules  which  are  to  be  followed 
in  every  case.  Whenever  the  vividness  and  accuracy  of  the  sta- 
tistical picture  is  not  sacrificed  by  so  doing,  the  conventional  and 
generally  accepted  ways  should  be  followed. 

Exercises. 

5.     Plot  the   following  data  of  annual  precipitation.* 


1879  .... 

.  31.?  1889  ,-'.. 

..  28.5 

1899  .... 

.  28.5  1909 

36.6 

1880  .... 

.  44.7 

1890  .. 

..  50.7 

1900  .... 

.  30  3 

1910 

34.8 

1881  .... 

.  47  0 

1891  .. 

..  42.1 

1901  .... 

.  26.5 

1911 

.  ....  43.4 

1882  .... 

.  51.3 

1892  .  . 

..  33.5 

1902  .... 

.  34.2 

1912 

29.6 

1883  .... 

.  48.9 

1893  .. 

..  38.1 

1903  .... 

.  28.1 

1913 

40.9 

1884  .... 

.  31.0 

1894  .. 

..  29.5 

1904  .... 

.  31.5 

1914 

31.2 

1885  .... 

.  43.3 

1895  .. 

..  30.7 

1905  .... 

.  35.1 

1915 

39.9 

1886  .... 

.  42.4 

1896  .  . 

..  40.5 

1906  .... 

.  33.7 

1916 

34.4 

1887  .... 

.  30.3 

1897  .. 

..  41.2 

1907  .... 

.  37.6 

1888  ... 

.  35.1 

1898  .. 

...  41.3 

1908  .... 

.  30.1 

Since  the  lowest  number   of   inches   is   26.5   it   is  better  to  make  a 
break  in  the  vertical  scale,  starting  the  working  scale  with,  say,  25  inches. 


Report  of  Columbus  Station,  U.  S.  Weather  Bureau. 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS 


6.     Plot  the  following  data  of  mean  monthly  temperatures.* 


II 


1870 

53  1 

1880 

52  2 

1800 

53  2 

1000  

52  0 

1880 

53  6 

1800 

53  2 

1000 

....  53  8 

1010  .... 

.  51.7 

1881 

54  2 

1801 

52  6 

1001 

51  8 

1011  

.  53.8 

1882 

.  .  .  .  53  4 

1802 

51.3 

1002 

52  .  1 

1012 

.  50.8 

51  8 

1803 

51  2 

1003 

52  0 

1013  .... 

.  53.5 

1884 

52  5 

1804 

53  3 

1004 

50  2 

1014  

52  0 

1885 

40  1 

1805 

51  6 

1005 

51.5 

1015 

51.8 

1886- 

50  3 

1806 

...  53  0 

1006 

52.7 

1916 

.  52.0 

1887 

52  5 

1807 

...  52  0 

1007 

50.8 

1888 

.  51.0 

18fl8 

.  53.6 

1008 

.  53.5 

7.:    Plot  the   curve  of  Top   Beef   Cattle   Prices    from  the   following 
data  :** 


1801 

7  15 

1808 

6  25 

1005  .  . 

7  00 

1012 

11.25 

180-} 

7  00 

1800 

8  25 

1006 

7  60 

1013  .. 

10.25 

1803 

6  75 

1000  . 

7  50 

1007  

8  00 

1014  .  .  . 

..11.40 

1804 

6  40 

1001 

8  00 

1008 

8  40 

1015 

.  11  60 

1895 

....  6  60 

1002  

0  00 

1000  .  .  . 

0  50 

1016  ... 

..13.00 

180(1 

6  50 

1003 

6  85 

1010 

8  85 

1807 

6.00 

1004  .... 

.  7.65 

1011  .... 

.  0.35 

8.     From  the  data  of  page  25  plot  the  1016  beef  cattle  prices. 
0.     From  the  data  of  page  25  plot  the  1805  beef  cattle  prices. 

The  Title  of  a  Diagram.  Each  diagram  must  be  pro- 
vided with  a  brief  and  concise  and  yet  accurate  and  comprehen- 
sive title.  The  title  must  cover  all  of  the  data  and  not  merely  a 
certain  section  of  it  and  it  must  do  this  without  being  of  undue 
length.  A  careful  study  of  examples  of  titles  is  especially  help- 
ful in  acquiring  a  notion  of  what  constitutes  a  proper  title. 

All  headings  of  columns  must  be  clear  and  definite.  The 
units  of  measurement  of  a  scale  must  always  be  given ;  thus, 
" Precipitation  in  inches,"  "Temperature  in  degrees'. 

Titles  and  headings  have  a  better  appearance  when  made 
in  Roman  characters  than  when  made  in  script.  In  general  the 
size  of  type  in  each  heading  or  sub-title  should  correspond  in  size 
and  prominence  to  its  respective  importance.  Unless  the  letter- 
ing is  skilfully  done  by  hand  it  is  better  to  use  a  typewriter  even 
tho  different  sizes  of  letters  cannot  be  secured  by  its  use. 


*  Report  of  Columbus  Station,  U.  S.  Weather  Bureau. 
**  Chicago  Live  Stock  World,  January  2,  1017. 


12 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS 


Exercises. 

10.  Study  the  titles   and   headings   of   the   diagrams   and   tables   of 
Vol.  V,  Report  of  the  United  States  Census,  1910. 

11.  Study   the    titles    shown    in    "Graphic    Methods    for    Presenting 
Facts"  by  Willard  C.  Brinton.f 

12.  Study    the    titles    and    headings    of    the    current    issue    of    the 
Monthly   Crop   Reporter,   Department  of  Agriculture. 

In  each  of  the  following  exercises  construct  a  cont^lete  statistical 
diagram  with  the  curve  carefully  drawn  and  an  appropriate  title  de- 
signed for  each. 

13.  The    Land    area  of    the    United    States    exclusive    of  Outlying 
possessions  from  Table  18,  Vol.  I,  Report  of  the  United  States  Census, 
1910. 

14.  The  population  of  Ohio  from  Table  10,  same  report. 

15.  Comparative    Values    of    Inside    Lots    of    Different   Depths    ac- 
cording to  the  Lindsay-Bernard  system  of  valuation. 


The    Lindsay-Bernard    and    Somcrs    Valuation    Schedule.* 

Lindsay- 

Bernard.  Somers. 

$82.0  ...........  $93.33 

84.2  ...........  95.60 

86.2  ...........  97.85 

88     ...........  100.00 

89.6  ...........  102.08 

91.1  ...........  104.00 

92.5  ...........  105.78 

93.8  ...........  107.50 

95     ...........  109.50 

96.1..  110.50 


Depth. 
5. 

10 

Lindsay- 
Bernard. 
$9       ... 
15 

Somers. 
$14.35 
25.00 

Def>th. 

85. 
90 

15 

21     ... 

32.22 

95 

20 

27 

41.00 

100 

25. 

33 

47.90 

105. 

30. 
35 

38.5.... 
44 

54.00 
59  20 

110. 
115 

40. 
45 

49 

54 

64.00 
68  45 

120. 
125 

50. 

58.5.... 

72.50 

130 

55. 

63     .. 

76  20 

135 

60 

67 

79  50 

140 

65 

70.6.... 

82.61 

145 

70. 
75. 

73.9.... 
76.9.... 

85.60 
88.30 

150. 

175 

80. 

79.6... 

90.90 

200. 

97.2, 
98.2. 
99.2. 

100     . 

103     . 

105 


111.80 
113.00 
114.50 
115.00 
119.14 
122.00 


16.  Comparative    Values    of    Inside    Lots    of    Different    Depths    ac- 
cording to  the  Somers  system. 

17.  The  accumulated  value  of  $1  at  10%  compound  interest: 
Year.  123456789         10 
Amount     ..     1.00     1.10    1.21     1.33     1.46     1.61     1.77     1.95    2.14    2.36 


fThe   Engineering  Magazine,   1915,   N.  Y. 
*The  National   Real   Estate  Journal,   May,   1914. 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS 


18.  The  Average  Yield  per  Acre  for  Wheat  in  the  United  States 
since  1866;  Yearbook,  Dep't  of  Agriculture. 

19.  Average    Farm    Price    per    bushel    of    Wheat    in    the    United 
States  since  1866;  Yearbook,  Dep't  of  Agriculture. 

20.  Per  cent  of  Wheat  Crop  Exported  since  1866 ;  Yearbook,  Dep't 
of  Agriculture. 

21.  Total   Production   of   Wheat  to   nearest   10   million  bushels   in 
the  United  States  since  1866 ;  Yearbook,  Dep't  of  Agriculture. 

22.  Substitute  the  word  Corn  for  Wheat  in  Exercises  17  to  20  and 
construct  the  curves. 

23.  Bank  Clearings  of  the  United  States,  excluding  N.  Y. 

Bank  Clearings  of  U.  S.  excluding  N.  Y.   (in  millions). 


1883   $14,209 

1884   12,919 

1885   13,170 

1886   15,513 

1887    17,566 

1888   18,397 

1889   20,280 

1890   23,370 

1891    23,198 

1892    25,660 

1893  .  23,049 


1894 $21,298 

1895  23,507' 

1896  ,v,  22,304 

1897  23,895 

1898  26,959 

1899  33,416 

1900  33,771 

1901  39,152 

1902  41,695 

1903  43,239 

1904  .  43,972 


1905  $50,087 

'1906  55,327 

1907  57,994 

1908  53,133 

1909  62,249 

1910  66,821 

1911  67,857 

1912  73,209 

1913  75,181 

1914  72,225 


24.     Percapita  Imports  of  U.   S. : 


1860 

$11  25 

1879 

1861  ... 
1862 

9.02 
5.79 

1880 
1881 

1863  ... 
1864  ... 
1865  .. 

7.29 
9.30" 
6.87 

1882 
1883 
1884 

1866 

12.26 

1885 

1867  .  . 

10.23 

1886 

1868 

9.94 

1887 

1869  .. 
1870  .. 
1871 

11.60 
11.97 
14.47 

1888 
1889 
1890 

1872 

16.15' 

1891 

1873 

14.27 

1892 

1874 

13.13 

1893 

1875 

11.43 

1894 

1876  .. 
1877  .. 
1878  .. 

9.47 
10.37 
9.07 

1895 
1896 

$10.52 
13  88 

1897  .... 
1898  . 

$10.32 
8  66 

13.06 

1899 

10  68 

14.36 

1900 

10  86 

12  81 

1901 

11  34 

11.48 

1902  .  . 

12  30 

10  49 

1903 

12  42 

11.57 

1904  .... 

12  7] 

12.09 

1905 

14  24 

12  11 

1906 

15  69 

12.58 

1907 

16  29 

13  15 

1908 

12  54 

12.96 

1909  ... 

16  28 

..  ..   12.91 

1910 

10  94 

11  68 

1911 

16  32 

9.97 

1912  .. 

19  04- 

11  60 

1913 

18  47 

9  66 

1914  .. 

.  .  .   18  14 

Note   that   the   data   of   the   two   preceding   exercises   shows   a   de- 
cided  periodicity  or  wave-like  nature. 


14  INTRODUCTION    TO    MATHEMATICAL    STATISTICS 

More  than  One  Curve  on  the  Same  Diagram.  For  the 
purpose  of  comparing  different  curves  it  is  often  convenient 
to  plot  two  or  more  curves  on  the  same  diagram.  For  instance, 
simultaneous  variations  in  the  prices  of  wheat  and  corn  can  be 
observed  to  good  advantage,  when  the  two  curves  are  brot 
together  on  the  same  diagram  and  constructed  to  the  same  scales. 
The  chief  disadvantage  of  this  method  of  comparing  curves  lies 
in  the  resulting  complexity  of  the  diagram.  If  the  diagrams 
are  constructed  on  thin  paper  and  the  lettering  and  curves  are 
made  heavy  the  different  curves  when  made  on  separate  sheets 
can  be  readily  compared  by  adjusting  one  sheet  of  paper  above 
the  other. 

Exercises. 

25.  Compare  the  rainfall  curve  of  Exercise  5  with  the  temperature 
curve   of   Exercise   6.     To"  what   extent   do   the   two   curves  vary  in  the 
same   directions?     What   conclusions   can   be   drawn   as   to   the  tendency 
for  the  amount  of   rainfall  to  depend  on  the  temperature? 

26.  Compare   the   two   systems   of    real   estate   valuation   of   Exer- 
cises 15  and  16. 

27.  Give  a  comparative  interpretation  of  the  curves  of  Exercises 
18   and    19.     Why   should   they   not   be    expected   to   follow   exactly  the 
same  general  course? 

28.  Discuss  as  in  Exercise  27  curves  of  prices  and  yield  per  acre 
of  corn. 

29.  Compare  the  curves  of  Exercises  21  and  23. 

Coordinates.  It  is  convenient  to  have  a  standardized 
notation  for  the  horizontal  and  vertical  scales.  The  horizon- 
tal line  is  denoted  by  O-X  and  called  the  axis  of  abscissas 
or  simply  the  X-axis.  The  vertical  line  is  denoted  by  O-Y 
and  called  the  axis  of  ordinates  or  the  Y-axis.  The  point 
where  the  two  lines  meet  is  the  origin  of  coordinates.  Dis- 
tances along  the  X-axis  are  spoken  of  as  x  distances  or  x 
coordinates,  and  those  along  the  Y-axis  as  y  distances,  or 
y  coordinates.  Thus  in  the  precipitation  data  of  page  7,  the 
origin  is  at  January,  1879,  and  the  values  of  X  differ  by  intervals 
of  one  month,  while  the  unit  interval  for  Y  is  one  inch. 

Logarithmic  Curves.  Whenever  the  data  seems  to  ex- 
hibit a  uniform  rate  of  increase  or  whenever  it  is  desired  to 
study  the  relative  changes  rather  than  the  actual  changes  in  the 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS  1 5 

data,  a  logarithmic  curve  may  be  of  service.*  A  logarithmic 
curve  is  obtained  by  taking  the  logarithms  of  the  measurements 
and  using  these  logarithms  as  vertical  distances  or  ordinates. 
Since  multiplying  two  numbers  adds  their  logarithms, a  constant 
ratio  or  rate  will  appear  in  the  logarithmic  diagram  as  a  con- 
stant addition.  Hence  if  there  is  a  constant  rate  in  the  data 
the  logarithmic  curve  will  be  a  straight  line.  Whether  the  rate 
is  constant  or  not,  curves  of  this  type  are  of  value  for  com- 
paring different  rates.  However,  if  the  rate  is  not  approximately 
constant  considerable  familiarity  with  logarithms  is  necessary 
if  the  comparative  differences  are  to  be  correctly  interpreted. 

Exercises. 

30.  Plot  the  logarithmic  curve  of  the  data  of  Exercise  17. 

31.  Plot  the  logarithmic  curves  of  the  data  of  Exercises  15  and  16. 

32.  Plot   the   logarithmic    curve    of    the    Chicago    Top    Beef    Cattle 
Prices. 

Cumulative  Curves.  All  the  preceding  curves  show  the 
respective  values  for  each  interval  of  the  horizontal  axis,  as 
the  production  of  wheat  for  each  year  since  1866  is  shown  by 
the  curve  of  Exercise  21.  Now  if  it  is  desired  to  construct  a 
curve  exhibiting  at  each  year  the  total  production  of  wheat  since 
1866,  the  amount  of  each  year's  production  is  added  to  that 
of  all  the  preceding  years  and  the  resultant  cumulative  sums 
plotted.  In  this  way  a  curve  is  obtained  which  starts  at  the 
lower  left  hand  corner  and  proceeds  in  a  diagonal  direction 
across  the  diagram.  It  is  called  a  cumulative  curve.  The 
values  to  be  plotted  will  be,  in  the  case  of  the  cumulative  curve 
of  wheat  production,  150,000,000,  360,000,000,  580,000,000, 
840,000,000,  etc. 

Exercises. 

33.  Plot  the  cumulative  curve  of  wheat  production. 

34.  Plot  the  cumulative  curve  of  corn  production  and  compare  with 
the  curve  of  Exercise  33. 

35.  Of  what  significance  is  the  slope  of  a  cumulative  curve? 


*  See  "The  Ratio  Curve,"  Fisher.     Quarterly  Publications  American 
Statistical  Association,  June,  1917. 


CHAPTER  II. 
CURVE  PLOTTING— (Continued.) 

Interpolation.  The  curves  of  the  preceding  chapter  were 
drawn  for  the  purpose  of  connecting  the  plotted  points  in  order 
to  assist  the  eye  in  following  the  course  of  the  data  across  the 
diagram.  However,  other  uses  can  be  made  of  a  statistical 
curve. 

At  the  beginning  of  Chapter  I  the  data  of  monthly  pre- 
cipitation is  given.  What  was  the  weekly  precipitation?  The 
Chicago  Top  Beef  Cattle  monthly  prices  are  given  under 
Exercise  7,  Chapter  I.  What  were  the  weekly  prices  during 
the  period  covered  by  that  data?  The  population  of  the 
United  States  is  given  for  ten-year  intervals.  What  has  been 
the  population  from  year  to  year?  These  are  essentially 
questions  of  interpolation,  that  is,  of  estimating  values  lying 
between  the  given  values. 

The  method  of  obtaining  intermediate  values  from  the 
curve  consists  merely  of  measuring  on  the  vertical  scale  the  height 
of  the  curve  at  the  required  point.  Thus  with  the  population 
curve  of  Exercise  4,  Chapter  I,  which  is  constructed  from  the 
decennial  census  reports,  the  population  for  the  year  1906  is  given 
by  the  height  of  the  curve  above  the  1906  point  on  the  horizontal 
scale. 

Exercises. 

1.  Estimate  the  Top  Beef  Cattle  Prices  for  each  week  in  February 
1916,  from  the  monthly  data  of  Exercise  7,  Chapter  1. 

2.  Estimate  the  values  of  inside  lots  for  the  fraction  of  a  foot,  say 
67.5  feet,  from  the  data  of  Exercise  16  of  the  preceding  chapter. 

3.  What  is,  according  to  the  data  of  Exercise  17  of  Chapter  I,  the 
compound  amount  of  $1  for  7.5  years  at  10%? 

This  method  of  interpolating  makes  an  estimated  value 
depend  on  the  two  consecutive  given  values  which  inclose  it. 
But  the  increase  in  population  during  a  decade  may  have  oc- 
curred almost  entirely  during  the  last  years  of  the  period  and 

(16) 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS  17 

yet  the  shape  of  the  curve  when  drawn  merely  to  connect  the 
ten-year  points  may  give  no  hint  of  this  irregularity  of  increase. 
The  temperature  for  one  month  may  have  no  connection  with 
that  of  the  preceding  month  and  hence  the  curve  between  the 
points,  depending  as  it  does  on  the  two  non-related  values  can 
hardly  be  expected  to  give  the  actual  temperature  for  an  inter- 
mediate week  or  day.  If  the  price  of  wheat  for  the  year  1905 
is  omitted  can  it  be  reliably  estimated  by  drawing  the  curve 
from  the  years  1904  and  1906  and  then  interpolating  for  the 
missing  year? 

It  must  be  apparent  therefore  that  a  curve  which  passes 
thru  a  series  of  more  or  less  non-related  points  can  be  of  little 
value  in  interpolation  and  that  the  problem  of  interpolation  is 
essentially  one  of  determining  by  some  means  or  other  the 
general  course  of  the  data  and  then  estimating  the  intermediate 
values  in  conformity  with  this  general  trend.  The  values  ob- 
tained in  this  way  are  the  most  probable  values ;  accidental  varia- 
tions which  bear  no  relation  to  the  underlying  tendencies  can  not 
be  so  estimated ;  in  fact  such  variations  can  not  be  estimated 
or  predicted  by  any  means. 

The  Smoothing  of  a  Curve.  The  curves  of  Chapter  I, 
drawn  as  they  are  thru  each  point,  preserve  all  the  variations 
whether  they  are  fundamentally  essential  or  due  merely  to  the 
presence  of  accidental  influences.  The  curve  of  mean  monthly 
temperatures,  Exercise  6  of  the  preceding  chapter,  shows  dis- 
tinct seasonal  variations  in  temperature — higher  temperatures  in 
summer  and  lower  in  winter.  Along  with  these  essentially 
significant  changes  are  fluctuations  apparently  accidental  as,  in 
one  year  June  is  warm  and  in  another  relatively  cool ;  some- 
times January  is  warmer  than  February  and  sometimes  the 
reverse  is  true. 

To  represent  a  general  movement  or  trend  the  curve 
must  be  drawn  without  abrupt  changes  in  direction  and  must 
sweep  among  the  points  rather  than  necessarily  thru  each 
point.  Since  such  a  smoothed  curve,  as  it  is  called,  depends 
on  the  general  or  collective  characteristics  of  the  data  the  draw- 
ing of  it  must  be  based  on  collective  properties  of  the  measure- 
ments. One  pertinent  general  property  has  just  been  stated ; 
namely,  that  the  curve  must  be  smooth,  that  is,  not  have  abrupt 


l8  INTRODUCTION    TO    MATHEMATICAL    STATISTICS 

changes  in  direction.  This  property  expresses  the  statistical 
assumption  that  the  significant  variations  are  fairly  uniform 
from  value  to  value  and  not  capricious  or  arbitrary.  A  second 
assumption,  which  is  presently  discussed,  is  that  certain  areas 
are  relatively  stable  and  unchanging. 

Smoothing  by  Inspection.  The  smoothing  of  a  curve  may 
be  based  on  a  study  of  the  data  and  made  a  matter  of  the  skill  and 
experience  of  the  statistician  without  the  assistance  of  definitely 
stated  assumptions  or  properties.  The  curve  is  then  said  to  be 
smoothed  by  inspection. 

In  smoothing  a  curve  the  first  step  is  to  study  the  data 
carefully.  Without  such  an  investigation  into  the  probable 
sources  and  extent  of  the  irregularities  and  fluctuations  one- 
cannot  hope  to  know  what  irregularities  to  smooth  out  and  what 
to  leave  in.  A  curve  cannot  be  reliably  smoothed  by  a  statis- 
tician who  does  not  know  the  data  thoroly.  On  the  basis  of 
the  information  gained  by  this  study  a  preliminary  curve  should 
then  be  drawn  freehand  among  the  points.  By  successive 
erasures  and  redrawings  the  finished  curve  can  gradually  be 
arrived  at.  Thus  a  curve  showing  the  long  time  movements  in 
the  price  of  wheat  will  pass  above  some  points  and  below  others' 
and  how  much  the  curve  should  miss  any  point  can  not  be  deter- 
mined without  a  knowledge  of  financial  conditions,  yields,  etc. 

The  inspection  method  of  smoothing  a  curve  is  often  suf- 
ficiently accurate  for  all  practical  purposes,  especially  when 
done  by  a  statistician  of  experience  and  especially  when  there 
is  a  considerable  element  of  inaccuracy  inherent  in  the  data. 
Its  disadvantage  lies  obviously  in  the  fact  that  no  two  smooth- 
ings  of  the  same  curve  will  be  exactly  alike;  the  method  is  es- 
sentially tentative  and  personal. 

In  any  event  a  rough  preliminary  draft  of  the  curve  should 
be  made  by  inspection  before  proceeding  to  apply  more  re- 
fined methods. 

Exercises. 

4.  Smooth  the  illustrative  data  at  the  beginning  of  Chapter  I. 

5.  Smooth  the  data  of  the  population  of  the  United  States  as  given 
in  Exercise   II,   Chapter   I. 

6.  Smooth  the  data  of  annual  rainfall  of  Exercise  5,  Chapter  I.  . 

7.  Smooth  the  data  of  Exercises  18,  19,  20  and  21  of  Chapter  I. 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS  19 


-  iis 

ill 


-7,rrflpr:;::«nyT'Un|  ^l  ^  !  '&V  \  »~  \   rtcv     -~- 

FIG.    IV.     The    Smooth    Curve   of   Monthly    Precipitation 
at  Columbus,  1916. 

The  Preservation  of  Areas.  In  the  illustrative  data  at  the 
inning  of  Chapter  I  the  precipitation  of  4.9  inches  in  March 
total  precipitation  for  the  whole  month.  With  a  base  of  one 
unit,  then  a  rectangle  of  height  4.9  will  have  an  area  equal  to  the 
total  precipitation.  Likewise  the  rectangle  on  the  July  unit  as  a 
base  will  have  an  area  equal  to  0.7,  which  is  the  July  precipitation. 
The  prices  of  Exercise  7,  Chapter  I,  can  in  a  similar  manner  be 
represented  by  rectangles  with  heights  equal  to  the  respective 
prices  and  with  unit  bases.  The  population  data  of  Exercise  2 
of  the  same  chapter  may  be  represented  by  rectangles  which  are 
not  adjacent  and  have  nine  rectangles  omitted  between  successive 
census  years. 

After  the  curve  is  smoothed  each  rectangle  will  be 
altered  so  as  to  have  a  curved  top.  The  total  area  under  the 
finished  curve  will  then  be  the  sum  of  the  areas  of  the  modi- 
fied rectangles.  The  First  Rule  of  Preservation  of  Areas  is 
that  the  curve  should  be  so  smoothed  that  the1  total  area 
under  the  resulting  curve  is  equal  to  the  sum  of  the  areas  of  the 
original  rectangles.  Since,  for  instance,  the  monthly  precipita- 
tion is  made  up  of  the  sum  of  the  daily  precipitations  it  is  like- 
\\i-c  reasonable  to  assume  that  the  monthly  sum  is  more 
stable  than  is  the  daily  or  weekly  and  hence  we  have  the 
Second  Rule  of  the  Preservation  of  Areas;  namely,  where 
possible,  the  areas  of  the  individual  rectangles  are  to  remain 
unchanged.  This  can  be  done  by  adding  to  and  subtracting  from 
each  rectangle  an  equal  sum. 

Within  the  requirement  that  the  curve  must  be  free  from 
abrupt  changes  in  direction  the  two  preceding  working  rules 
furnish  a  fairly  comprehensive  basis  for  the  smoothing  of 


2O 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS 


statistical  data.  In  later  chapters  more  detailed  rules  will  be  dis- 
cussed and  applied.  However,  for  most  data  the  present  rules 
are  sufficient. 

As  explained  for  the  precipitation  data  a  definite  statistical 
meaning  can  usually  be  found  for  the  rectangles.  Even  when  a 
significance  is  with  difficulty  ascribed  to  the  rectangles  they  should 
be  drawn  and  the  same  rules  applied  to  the  smoothing  as  before. 
The  method  is  in  such  cases  justified  wholly  by  its  practical  con- 
venience. 

In  the  illustrative  plotting,  at  the  beginning  of  Chapter  I, 
of  the  data  of  monthly  precipitation  at  Columbus  for  the  year 
1916,  the  vertical  scale  was  laid  off  on  a  line  thru  the  January 
point.  In  constructing  the  rectangles  for  smoothing,  it  is  con- 
venient to  have  the  January  and  other  perpendiculars  at  the 
middle  of  the  respective  intervals  in  order  that  there  may  be  a 
half  unit's  space  at  the  left  of  the  beginning  point.  The  zero 
point  on  the  horizontal  scale  is  then  at  the  beginning  of  the  first 
interval  and  the  vertical  distance  for  the  first  point  is  taken  not 
on  the  vertical  scale  line  but  perpendicularly  above  the  mid-point 
of  the  interval.  Whenever  the  curve  is  to  be  smoothed  the  scale 
is  marked  off  in  this  way;  ordinarily  the  method  of  Chapter  I  is 
employed  where  the  curve  is  not  to  be  smoothed. 
*  The  following  diagram  illustrates  the  application  of  the 
rectangle  method  of  smoothing  to  the  monthly  precipitation  data. 


FIG.     V.     The     Rectangle     Method     of     Smoothing     the 
Monthly  Precipitation  data  for  Columbus  in  1916. 

Exercises. 

8.  Construct  the  smoothed  curve  of  prices   from  the  data  of   Ex- 
'ercise  7,  Chapter  I. 

9.  Do  the  same  for  the  data  of  Exercise  5  and  of  Exercise  6,  of 
the  same  chapter. 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS  21 

10.  Do  the  same  for  the  data  of  Exercises  18,  19,  20  and  21  of  the 
same  chapter. 

11.  Do  the  sam€  for  the  data  of  Exercise  22  of  the  same  chapter. 

12.  Can  the  rules  of  permanence  of  areas  be  applied  effectively  to 
the  drawing  of  the  curve  for  the  data  of  Exercise  17  of  the  preceding 
chapter?      Why?      To    the    data    of    Exercises    15    and    16    of    the    same 
chapter? 

!•">.  In  drawing  the  smooth  curve  of  decennial  census  population  it  is 
advisable  to  alter  the  original  data  very  slightly,  if  at  all.  Discuss. 

A  common  way  of  drawing  this  curve  is  to  connect  the  ten  year  points 
by  a  series  of  straight  lines  and  then  round  out  the  angles  where  the 
lines  intersect.  This  assumes  a  uniform  annual  increase  during  the  der 
cade  —  an  assumption  which  may  or  may  not  be  true. 

14.  The  statistical  significance  of  the  rectangles  has  been  discussed 
for  the  precipitation  data.    Develop  the  corresponding  explanation  for  the 
decennial  census  data. 

15.  Show  that  in  the  data  of  Exercises  15,  16  and  17  of  the  preceding 
chapter   the   rectangles   are   not  significant. 

The  Adjusted  Data;  Interpolation.  Since  in  general  it 
is  impossible  to  preserve  exactly  the  area  of  each  rectangle 
the  process  of  smoothing  will  lead  to  values  differing  from 
those  of  the  original  data.  Consequently,  the  data  is  said  to 
be  adjusted  or  graduated  or  smoothed  by  means  of  the  curve. 
In  accordance  with  the  reasoning  at  the  beginning  of  this 
chapter  the  adjusted  values  are  to  be  taken  as  giving  a  more 
significant  idea  of  the  true  trend  of  the  data  than  does  the 
original  data.  It  is  evident  that  we  have  here  the  solution 
to  the  problem  of  interpolation.  Therefore,  the  rule  for 
interpolation  is:  to  obtain  the  value  at  any  point  on  the  hor- 
izontal scale  measure  the  corresponding  ordinate  of  the  smoothed 
curve,  or  measure  the  proper  area  under  that  curve.  Thus  the 
rainfall  during  the  first  week  in»June  is  obtained  by  measuring 
the  area  under  the  curve  on  the  first  one-fourth  of  the  June  base 
unit. 

Test  of  a  Graduation.  The  extent  to  which  smoothing 
preserves  the  areas  of  the  individual  rectangles  is  often  taken 
as  a  test  of  the  appropriateness  of  the  smoothing  or  gradua- 
tion. The  smoothed  curve  is  said  to  fit  the  data  and  the 
term  "goodness  of  fit"  is  used  to  denote  the  appropriateness 
of  the  methods  used  in  the  process  of  constructing  the  smooth 
curve.  The  goodness  of  fit  is  then  measured  by  the  extent  to 
which  the  areas  of  the  individual  rectangles  are  preserved.  In 


22  INTRODUCTION    TO    MATHEMATICAL    STATISTICS 

applying  this  test  two  columns  of  numbers  are  set  down,  in  one 
the  original  values  and  in  the  other  the  adjusted  values.  The 
differences  are  then  taken  and  studied.  Other  conditions  being 
equal  the  smoothing  with  the  smallest  differences  is  the  best,  tho 
the  judging  of  goodness  of  fit  is  largely  a  matter  of  experience. 

Exercises. 

16.  Discuss  the  goodness  of  fit  of  each  of  the  curves  smoothed  in 
the  preceding  exercises. 

17.  What  is  the  best  estimate  on  the  basis  of  the  data  of  page  25 
of  the  Top  Beef  Cattle  Prices  for  the  first  week  in  February,  1916? 

Note  that  in  this  data  the  rectangles  have  no  special  statistical  sig- 
nificance. 

18.  From  the  data  of  Exercise  23  of  the  preceding  chapter  what  is 
the  best  estimate  of  the  bank  clearings  in  the  United  States  for  the  first 
half  of  the  year  1908? 

19.  What  is  the  significance  of  the  rectangles  in  the  case  of  the 
data  of  Exercises  14,  15,  16  of  Chapter  1? 

20.  In  drawing  the  curves  of  Exercise  19  should  the  values  be  ad- 
justed?    Are  these  curves  drawn  by  a  process  of  smoothing? 

Determining  the  General  Trend  of  the  Data.  The  char- 
acteristics of  a  movement  over  a  number  of  years  can  be  deter- 
mined from  the  smoothed  curve.  Thus  the  general  upward 
trend  of  prices  during  the  years  1897  to  1917  is  shown  by  the 
rise  of  the  curve. 

Perhaps  the  best  way  to  picture  a  general  movement  in  the 
data  is  to  draw  a  straight  line,  or  more  than  one  straight  line 
where  there  seems  to  be  more  than  one  distinct  movement,  to  fit 
the  data.  That  is,  to  smooth  the  data  with  a  straight  line.  With 
data  not  conforming  closely  to  a  straight  line  there  is  likely  to 
be  some  uncertainty  in  the  exact  location  of  the  straight  line  or 
lines  but  since  the  lines  are  but  the  pictures  of  the  ideas  of  gen- 
eral increases  or  decreases  the  uncertainty  is  neither  greater  nor 
less  than  is  the  uncertainty  in  the  ideas  of  the  general  movements 
themselves.  The  difficulty,  in  reality,  is  due  to  a  lack  of  in- 
formation regarding  the  data.  The  methods  of  Chapter  X  are 
of  much  service  in  this  connection. 

Exercises. 

21.  During  the  last  37  years  has  there  been  an  appreciable  increaa 
or  decrease  in  the  precipitation  at  the  Columbus  Station? 

22.  During   the   same    time    has   there   been    a    decided    upward   c 
downward  movement  in  temperatures  at  the  same  place? 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS  23 

Periodic  Data.  In  smoothing  and  determining  the  gen- 
eral trend  of  data  care  must  be  taken  that  the  data  is  not 
smoothed  to  conform  to  a  straight  line  when  there  is  an  inherent 
periodicity  in  the  material.  The  data  of  Exercises  23  and  24  of 
Chapter  I  exhibit  significant  tendencies  for  the  values  to  be  high 
for  a  few  years  and  then  consistently  lower  for  a  few  years  and 
then  higher,  and  so  on,  thru  more  or  less  regular  and  uniform 
cycles.  In  smoothing  such  data  the  ideal  should  be  to  determine 
a  uniform  cycle  and  then  smooth  the  data  into  the  curve  made 
up  of  the  determined  cycles.  The  problem  of  smoothing  such 
data  is  complicated  by  the  fact  that  the  curve  in  addition  to  being 
composed  of  a  series  of  similar  loops  or  arches  also  has  a  ten- 
dency to  rise  or  fall.  Thus  the  imports  of  the  U.  S.  have  in- 
creased on  the  whole  during  the  last  50  years  tho  there  have  been 
increases  and  decreases  following  each  other  in  fairly  regular 
periods. 

Exercises. 

23.  Smooth  the  data  of  Bank  Clearings  as  given  in  Exercise  23  of 
the  preceding  chapter. 

24.  Smooth  the  data  of  Imports  as  given  in  Exercise  24  of  the  pre- 
ceding chapter. 

Jo.  To  what  extent  has  there  been  a  tendency  for  bank  clearings 

and  for  imports  to  increase  during  the  period  covered  by  the  given 
data? 

JO.  Discuss  the  periods  in  the  yield  per  acre  of  wheat  in  the  U.  S. 

27.  Do  the  same  for  the  production  of  wheat. 

28.  Summarize   the   uses   and    advantages   of   the   smooth   curve   as 
compared  with  the  curve  which  passes  exactly  thru  each  point. 


CHAPTER  III. 
FREQUENCY  CURVES. 

Definitions.  The  following  data  of  the  measures  of 
heights  of  750  students*  may  be  taken  for  purpose  of  illustration. 

The  measurements  are  classified  to  show  the  number  of  in- 
dividuals for  each  inch  of  height. 

Height.        Number.  Height.      Number. 

61  2  68  126 

62  10  69  109 

63  11  70  87 

64  38  71  75 

65  57  72  23 

66  93  73  9 

67  106  74  4 

750 
TABLE  I. 

Height,  the  attribute  or  characteristic  here  under  con- 
sideration, is  in  this  table  measured  to  the  nearest  inch,  giving 
a  group  or  class  interval  of  one  inch.  A  class  interval  or  class 
is  ordinarily  designated  by  the  value  of  its  middle  measure- 
ment, and  the  class  limits  are  located  on  either  side  at  a  half 
unit's  distance  from  this  mid-value.  All  individuals,  for  in- 
stance, with  height  between  67.5  and  68.5  belong  to  class  68;  here 
the  limits  are  67.5  and  68.5  and  the  class  is  designated  by  the 
number  68.  Instead  of  using  61,  62,  63,  etc.,  as  class  numbers, 
the  classes  may  be  simply  numbered  i,  2,  3,  etc.,  and  these 
numbers  used  as  class  numbers.  Again,  the  classes  may  be 
numbered  in  both  ways  from  some  point  within  the  range, 
as  68.  This  would  give  class  numbers  as  follows :  -  —  7,  —  6, 
-5,  —  4,  —3,  —2,  —  i,  o,  -f  I,  +  2,  etc. 

The  objects  measured  or  enumerated  are  referred  to  as 
variates  or  simply  as  individuals. 


*  Records  of  physical  measurements  at  Ohio  State  University  Gym- 
nasium, Freshman  class,  1913. 

(24) 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS  25 

The  size  or  frequency  of  a  class  is*the  number  of  indi- 
viduals within  that  class,  and  the  total  frequency  is  the  sum 
of  all  the  class  frequencies.  The  table  as  a  whole  constitutes 
a  frequency  distribution  of  height,  and  shows  the  number  of 
times  each  class  occurs. 

To  illustrate  the  method  of  constructing  a  frequency  dis- 
tribution let  us  take  the  following  data :  * 

Chicago  Monthly  Top  Beef  Cattls  Prices. 


Year. 

Jan. 

Feb. 

Mar. 

Apr. 

May, 

,  June.  July. 

Aug. 

Sept. 

Oct. 

Nov. 

Dec. 

1916  

..   $9.85 

$9.75 

$10.05 

$10.00 

$10.90 

$11.50  $11.30 

$11.50 

$11.50 

$11.  60  $12.  40 

$13.00 

1915  

...     9.70 

9.50 

9.15 

8.90 

9.65 

9.95    10.40 

10.50 

10.50 

10.60 

10.55 

11.60 

1914  

..     9.50 

9.75 

9.75 

9.55 

9.60 

9.45    10.00 

10.90 

11.05 

11.00 

11.00 

11.40 

1913  

9.50 

9.25 

9.30 

9.25 

9.10 

9.20      9.20 

9.25 

9.50 

9.75 

9.85 

10.25 

1912. 

,..     8.75 

9.00 

8.85 

9.00 

9.40 

9.60      9.85 

10.65 

11.00 

11.05 

11.00 

11.25 

1911  

...     7.10 

7.05 

7.35 

7.10 

6.50 

6.75      7.35 

8.20 

8.25 

9.00 

9.25 

9.35 

1910  

...     8.40 

8.10 

8.85 

8.65 

8.75 

8.85      8.60 

8.50 

8.50 

8.00 

7.75 

7.55 

1909  

...     7.50 

7.15 

7.40 

7.15 

7.30 

7.50      7.65 

8.00 

8.50 

9.10 

9.25 

9.50 

1908 

6.40 

6.25 

7.50 

7.40 

7.40 

8.40      8.25 

7.90 

7.85 

7.65 

8.00 

8.00 

1907 

.     7.30 

7.25 

6.90 

6.75 

6.50 

7.10      7.50 

7.60 

7.35 

7.45 

7.25 

6.35 

1906 

6.  SO 

6.40 

6.35 

6.35 

6.20 

6.10      6.50 

6.85 

6.95 

7.30 

7.40 

7.90 

1905 

6  35 

6.45 

6.35 

7.00 

6.85 

6.35      6.25 

6.50 

6.50 

6.40 

6.75 

7.00 

1904 

5  90 

6.00 

5.80 

5.80 

5.90 

6.70      6.65 

6.40 

6.55 

7.00 

7.30 

7.65 

1903  

...     6.85 

6.15 

5.75 

5.80 

5.65 

5.15      5.65 

6.10 

6.15 

6.00 

5.85 

6.00 

1902. 

7.75 

7.35 

7.40 

7.50 

7.70 

8.50      8.85 

9.00 

8.85 

8.75 

7.40 

7.75 

1901. 

6.15 

6.00 

6.25 

6.00 

6.10 

6.55      6.40 

6.40 

6.60 

6.90 

7.25 

8.00 

1900  

...     6.60 

6.10 

6.05 

6.00 

5.85 

5.90      5.85 

6.20 

6.15 

6.00 

6.00 

7.50 

1899  

6.30 

6.25 

5.90 

5.85 

5.75 

5.75      6.00 

6.65 

6.90 

7.00 

7.15 

8.25 

1898  

...     5.50 

5.85 

5.80 

5.50 

5.50 

5.35      5.65 

5.75 

5.85 

5.90 

6.25 

6.25 

1897     .   .  . 

5.50 

5.40 

5.65 

5.50 

5.45 

5.30      5.25 

5.50 

6.00 

5.40 

6.00 

5.65 

1896  

...     5.00 

4.75 

4.75 

4.75 

4.55 

4.65      4.60 

5.00 

5.30 

5.30 

5.45 

6.50 

1895. 

.     5.80 

5.80 

6.60 

6.40 

5.25 

6.00      6.00 

6.00 

6.00 

5.60 

5.00 

5.50 

The  width  of  the  classes  must  be  first  determined.  It 
would  be  possible  to  have  a  class  for  each  quotation  but  it 
would  be  found  highly  inconvenient.  The  error  introduced  by 
the  grouping  of  the  measurements,  the  quotations  in  this  case, 
is  ordinarily  of  no  practical  significance.  A  general  rule  in 
determining  the  width  of  the  classes,  and  hence  of  the  number  of 
classes-  is  to  make  as  wide  classes  as  is  practically  feasible — 
the  number  of  classes  is  perhaps  most  often  from  ten  to  twenty. 
In  this  case  the  width  is  taken  as  fifty  cents  and  the  limiting 
quotations  of  each  class  are  included  in  the  class. 

The  data  is  examined  and  a  score  made  for  each  occurrence 
of  the  class.  Thus  Class  I  with  the  range  450-499  appears  Feb., 
Mar.,  Apr.,  May,  June,  July,  1896;  as  an  occurrence  is  observed 

*  Yearbook,  Chicago  Live  Stock  World,  1017. 


26  INTRODUCTION    TO    MATHEMATICAL    STATISTICS 

a  mark  or  score  is  made — six  in  all.  After  the  scoring  is  com- 
pleted the  frequency  of  each  class,  that  is,  the  number  of  tallies 
or  scores  for  each  class,  is  noted  and  written  in  a  column. 

The  frequency  distribution  just  obtained  shows  the  number 
of  times  each  price-class  has  occurred  during  the  last  twenty- 
one  years, 

Exercises. 

1.  Construct   a   frequency  table   from   the   Top   Beef    Cattle   Prices 
using  class  intervals  of  twenty-five  cents  and  compare  with  the  distribu- 
tion obtained  when  the  class  interval  is  fifty  cents. 

2.  From  the  following  table  of  Mean  Monthly  Temperatures  at  the 
Columbus  Station  construct  a  frequency  table  with  a  class  width  of  five 
degrees. 


Year. 

Jan. 

Feb. 

Mar. 

Apr. 

May. 

June. 

July. 

Aug. 

Sept. 

Oct. 

Nov. 

Dec. 

1878 

78.6 

74.0 

65.9 

53.5 

43.1 

26.2 

1879  

25.6 

28.9 

41.6 

50.3 

64.1 

71.4 

78.4 

71.2 

61.4 

62.4 

43.8 

37.2 

1880  .  . 

43.8 

38.8 

40.8 

53.8 

68.8 

72.8 

75.1 

74.2 

65.4 

52.2 

32.8 

24.6 

1881  

24.2 

29.2 

36.8 

47.6 

67.8 

69.7 

78.6 

75.8 

74.6 

60.5 

44.4 

40.6 

1882  

32.6 

41.8 

44.8 

51.0 

57.4 

69.9 

71.8 

72.2 

66.2 

59.0 

42.6 

32.0 

1883  

27.1 

34.3 

35.3 

51.0 

60.3 

70.7 

74.1 

70.2 

64.0 

55.6 

44.7 

34.8 

1884  

20.0 

37.2 

39  2 

49.4 

61.7 

72.9 

73.8 

72.8 

71.0 

58.9 

41.4 

32.2 

1885  

22.9 

19.4 

33.1 

50.0 

61.2 

69.0 

76.6 

71.1 

64.0 

61.4 

40.9 

32.4 

1886  

23.4 

26.8 

39.0 

54.7 

62.8 

67.3 

72.2 

71.6 

65.8 

54.4 

38.8 

27.2 

1887  

26.8 

36.4 

37.3 

50.8 

67.6 

71.2 

79.6 

70.8 

64.0 

51.4 

41.7 

32.8 

1888  

26.8 

33.0 

36.5 

51.2 

60.6 

71.6 

73.2 

71.4 

C1.3 

48.7 

44.1 

34.2 

1889  

34.2 

26.4 

42.2 

61.8 

61.4 

67.7 

74.1 

70.2 

63.8 

49.0 

41.2 

44.6 

1890 

39.1 

40.6 

35.2 

52.3 

60.0 

74.6 

73.6 

70.2 

63.1 

53.8 

44.6 

31.8 

1891  

33.0 

36.8 

34.8 

52.9 

57.6 

72.4 

70.0 

71.0 

69.4 

52.8 

40.3 

40.0 

1892  

24.0 

35.7 

36.0 

49.4 

61.0 

74.2 

74.0 

73.0 

65.4 

53.6 

38.2 

30.0 

1893  

18.8 

30.8 

39.5 

51.6 

59.4 

71.4 

76.4 

71.8 

66.7 

55.1 

40.0 

33.0 

1894   .  . 

34.7 

29.4 

46.2 

51.5 

60.6 

72.4 

75.2 

72.2 

69.1 

54.8 

39.0 

34.6 

1895   .  . 

24.1 

21.0 

36.7 

53.2 

62.8 

74.9 

73.8 

75.6 

71.2 

48.2 

42.4 

34.9 

1896  

30.8 

31.8 

33.7 

58.6 

69.7 

70.8 

74.4 

72.9 

63.2 

50.4 

45.2 

36.4 

1897  

26.4 

34.0 

43.1 

50.6 

57.9 

69.8 

77.2 

71.1 

68.8 

59.8 

42.7 

33.7 

1898  

33.2 

31.5 

46.3 

48.6 

62.7 

73.8 

77.7 

75.0 

69  8 

55.0 

39.6 

30.0 

1899  

29.4 

22.8 

38.3 

55.2 

65.4 

73.4 

76.2 

75.8 

65.7 

59.2 

45.3 

31.1 

1900   .  .. 

32.8 

26.8 

34.5 

51.8 

65.4 

71.9 

76.2 

78.5 

71.2 

62.2 

42.4 

32.5 

1901  

30.4 

23.5 

40.6 

48.5 

61.1 

73.4 

79.9 

74.6 

66.6 

55.7 

38.6 

28.7 

1902  

28.8 

23.2 

42.7 

50.1 

65.0 

68.2 

75.2 

70.6 

65.2 

56.2 

49.8 

30.7 

1903  

28.3 

31.8 

47.8 

51.2 

65.8 

66.0 

74.5 

73.0 

67.6 

55.4 

38.4 

24.6 

1904 

22.8 

24.8 

40.9 

45.0 

62.1 

69.6 

73.4 

71.0 

67.1 

54.0 

42.4 

29.0 

1905  

24.0 

22.0 

44.4 

50.4 

62.3 

70.8 

74.9 

73.3 

67.1 

53.7 

40.6 

34.2 

1906  

36.6 

28.4 

32.0 

54.7 

62.8 

71.0 

73.2 

75.7 

70.0 

53.0 

42.4 

32.7 

1907  

33.5 

27.2 

46.8 

43.0 

55.7 

67.2 

73.9 

71.0 

66.7 

50.0 

40.2 

34.6 

1908  

30.0 

29.6 

44.5 

51.7 

64.0 

70.6 

75.6 

72.7 

70.4 

55.6 

42.8 

34.3 

1909  

32.8 

36.2 

38.1 

50.1 

59.8 

71.8 

72.0 

73.4 

64.1 

49.9 

49.6 

25.8 

.910. 

58.2 

26.2 

50.1 

52.6 

57.1 

68.4 

75.2 

73.4 

67.6 

57.8 

37.0 

26.5 

INTRODUCTION    TO    MATHEMATICAL    STATISTICS 


Year. 

Jan. 

Feb. 

Mar. 

Apr. 

May. 

June. 

July. 

Aug. 

Sept. 

Oct. 

Nov.    Dec. 

1911  

....     33.5 

35.4 

38.3 

49.0 

68.6 

72.8 

75.7 

73.9 

68.8 

54.1 

38.2      37.4 

1912  

....     19.2 

23.4 

34.2 

43.4 

64.2 

68.2 

74.9 

70.7 

68.0 

56.2 

42.6      34.4 

1913  

....     36.8 

27.2 

40.8 

50.4 

61.7 

71.8 

76.5 

75.4 

65.4 

54.4 

45.8      35.5 

1914  

....     34.0 

23.6 

36.5 

50.7 

63.8 

72.6 

75.9 

74.0 

64.8 

57.7 

42.9      27.6 

1915  

....     27.8 

36.0 

34.0 

56.3 

58.2 

68.0 

73.0 

68.2 

68.0 

56.7 

44.4      31.0 

1916... 

36.0 

26.8 

35.5 

49.9 

62.6 

65.9 

78.6 

75.8 

63.9 

55.2 

43.4      30.4 

3.  Construct  the  frequency  table  of  the  following  data  of  monthly 
precipitation  at  the  Columbus  Station.  Take  one-half  inch  for  the  class 
width. 


Year. 
1878  

Jan. 

Feb. 

Mar. 

Apr. 

May. 

June. 

July. 
3  58 

Aug. 

5  00 

Sept. 
2  84 

Oct. 
3.17 

Nov. 
3.06 

Dec. 
3.88 

1879  
1880  

1881  
1882 

1.66 
4.49 

2.25 
4  69 

1.43 
1.70 

4.44 
5  94 

3.77 
2.42 

4.01 
4  76 

0.92 
5.08 

2.04 
4  87 

2.09 
3.21 

2.00 
9  59 

2.68 
3.30 

4.02 
6  01 

3.67 
4.86 

5.33 
2  62 

4.64 
6.95 

2.09 
3  14 

2.33 
1.80 

1.54 
2  91 

0.23 

2.35 

8.64 
2  44 

3.52 
4.54 

5.35 
2  05 

4.29 
3.98 

5.23 

2  23 

1883  

3.20 

6.18 

3.20 

2.85 

6  38 

4  25 

3  75 

2.54 

2.43 

6  11 

3.87 

4.12 

1884  
1885  

1886 

2.25 

3.75 

4  36 

4.95 
2.39 

1  26 

3.59 
0.53 

3  90 

2.11 
4.61 

3  57 

3.79 
5.83 

7  67 

2.59 
5.08 

2  69 

2.16 
3.28 

4  17 

0.70 
5.90 

2  44 

3.46 
2.84 

3  61 

1.66 
3.11 

1  13 

0.99 
3.08 

4  18 

2.77 
1.85 

3  41 

1887  
1888  

2.35 
3.73 

6.48 
1.30 

2.56 
3.79 

3.44 
1.53 

2.97 
3  89 

2.82 
1  62 

1.45 

5  81 

2.21 
4  34 

1.35 
0  91 

0.30 

3.77 

2.45 
3  26 

1.87 
1.11 

1889  
1890  

1891  

3.37 

5.73 

2.84 

1.06 
6.12 

5.42 

0.66 
5.63 

4.64 

0.83 

4.32 

2.25 

3.92 
5.12 

2.73 

2.77 
4.95 

4.98 

2.94 
1.80 

4.69 

1.59 

2.75 

2.64 

3.34 
7.13 

1.05 

1.83 
3.02 

2.94 

3.83 
1.97 

5.44 

2.36 
2.19 

2.42 

1892  

2.21 

3.35 

2.23 

2.67 

3.58 

4.96 

3.31 

5.12 

1.47 

0.84 

2.20 

1.60 

1893  

2  25 

7  63 

1  92 

7  C8 

4.81 

2.89 

1.27 

1.65 

1  14 

3.33 

2.16 

1.97 

1894  
1895  

1895 

2.42 
4.67 

2  34 

3.11 
0.64 

1  93 

1.79 
1.23 

3  04 

1.79 
4.12 

2  70 

2.78 
1.73 

2  61 

1.12 

2.94 

3  38 

1.74 

1.45 

9  47 

2.64 
2.10 

3  53 

5.31 
1.48 

5  93 

1.93 
0.92 

0  55 

1.91 

5.32 

3  53 

2.95 
4.14 

1  52 

1897  
1898  

1.54 
5.29 

3.71 
1.67 

5.45 
7.03 

4.27 
2.05 

3.68 
6.04 

2.45 
1.63 

6.95 
2.33 

1.95 
7.16 

0.82 
1.77 

0.36 
2.95 

7.54 

2.30 

2.43 
1.09 

1899  
1900  

2.35 
3.01 

1.44 
3.30 

4.69 

2.59 

1.18 
1.76 

2.25 
1.82 

1.26 

2.45 

4.85 
3.89 

1.49 
3.02 

2.01 
0.97 

2.23 
2.86 

1.72 
3.71 

2.98 
0.92 

1901 

1  50 

0  88 

1  82 

2  21 

4  24 

6  31 

1  23 

1  71 

2  10 

0.33 

0  59 

3.61 

1902 

1  56 

0  51 

2  63 

1  60 

0  95 

8  52 

4  70 

1  62 

4  16 

1  85 

2  72 

3  41 

1903  
1901  
1905 

2.11 
2.80 
1  25 

4.44 
8.12 
1  57 

4.13 
4.93 
5  87 

2.47 
2.49 
3  15 

2.18 

4.01 
4  38 

3.07 
3.86 
2  78 

2.05 
2.48 
2  27 

0.67 
3.18 

5  45 

1.46 
0.83 
3  36 

1.84 
0.97 
5  45 

2.01 
0.18 

1  64 

1.71 
3.63 
1  87 

1905..  . 

1  98 

1  08 

4.59 

1.16 

2.47 

1.44 

5.27 

6.15 

1.59 

2.07 

2.57 

3.33 

1907. 

5  73 

0  43 

5  21 

3  27 

3.35 

3.39 

6.07 

2.47 

2.27 

1.59 

1.68 

1.85 

1903 

1  40 

3  66 

6  03 

2  75 

4  04 

2.13 

3.74 

2.34 

0  42 

1.20 

0.84 

1  59 

1909 

2  52 

4  97 

2  68 

3  20 

4  65 

3  88 

3  34 

2.53 

1  81 

2  77 

1.66 

2.58 

1910 

5  11 

5  05 

0  28 

2  52 

4  10 

2.93 

2.40 

0  42 

3  66 

5  22 

0  79 

2  31 

1911  
1912  
1913  

4.46 
1.58 
6  63 

1.71 
1.53 
2.09 

2.36 
4.56 
8  09 

4.37 
4.20 
3.91 

1.15 
2.65 
2.60 

4.04 
1.48 
1.56 

3.29 
3.50 

2.88 

3.62 
2.25 
2.10 

5.98 
2.83 
3.28 

5.21 
1.71 
2.05 

2.71 
1.01 
4.56 

4.53 
2.34 
1.13 

1914  
1915  

1916... 

2.21 
3.30 

5.02 

3.70 
1.52 

1.47 

2.46 
1.19 

4.88 

2.48 
0.95 

2.33 

1.28 

2.57 

4.81 

2.03 
5.06 

3.49 

1.6! 
6.85 

O.G5 

4.78 
7.01 

3.22 

1.26 
4.43 

1.54 

4.44 
0.94 

1.84 

1.99 
1.97 

1.58 

2.91 

4.15 

3.59 

28  INTRODUCTION    TO    MATHEMATICAL    STATISTICS 

4.  Study  the  frequency  distributions  of  population  with  respect  to 
age;   Report  of  the  Thirteenth   Census,   1910,   Chapter   IV,  Vol.   1,  with 
special  reference  to  the  size  of  the  various  class  intervals  and  note  two 
general   forms   of   stating  the   frequencies   of  the   classes. 

5.  Examine  the  different  forms  of  frequency  distributions  appearing 
in  the  report  of  the  Medico-Actuarial   Society's   Investigations,   Vols.   I, 
II,  III,  IV;  also  in  Biometrika,  Agricultural  Experiment  Station  Bulletins 
and  in  other  accessible  sources. 

6.  In  which  of  the  exercises  of  Chapter  I  is  the  data  in  the  fre- 
quency distribution  form  ? 

Plotting  a  Frequency  Distribution.  The  illustrative  data 
at  the  beginning  of  this  chapter  is  plotted  by  locating  14  equidis- 
tant points  on  a  horizontal  line,  one  for  each  height  class  from 
61  inches  to  74  inches  inclusive.  Then  at  the  middle  of  each 
interval  so  obtained  a  vertical  line  is  erected  with  a  height  pro- 
portional to  the  corresponding  class  frequency.  In  this  way  ta 
point  is  obtained  for  each  class. 

As  in  Chapter  II,  a  rectangle  is  constructed  on  each  in- 
terval. It  must  be  apparent  that  a  rectangle  in  the  case  of 
the  frequency  distribution  has  in  every  case  a  significant  statis- 
tical meaning — it  is  the  frequency  of  the  class.  Hence  the  sum 
of  the  areas  of  all  the  rectangles  is  the  total  frequency  of  the 
distribution. 

Smoothing  the  Frequency  Curve.  With  the  rectangles 
drawn,  the  smoothing  of  a  frequency  distribution  is  in  no  wise 
different  from  the  smoothing  of  the  data  discussed  in  the  preced- 
ing chapter.  However,  for  the  frequency  curve  the  two  rules 
of  the  permanence  of  areas  have  a  stronger  justification  because 
of  the  more  definite  significance  of  the  areas  under  the  curve. 

With  practice  in  the  construction  of  statistical  diagrams 
and  curves  the  rectangles  may  be  dispensed  with  and  the 
curve  drawn  by  inspection,  especially  when  the  data  contains 
a  large  element  of  uncertainty.  Also  the  broken  line  obtained 
by  joining  the  ends  of  the  ordinates,  called  the  freuency  poly- 
gon, may  be  smoothed  by  inspection  into  the  required  curve. 

Exercises. 

7.  Smooth  the  illustrative  data  at  the  beginning  of  this  chapter. 

8.  Smooth  the  frequency  distribution  of  Chicago  Top  Beef  Cattle 
Prices  for  50-cent  intervals. 

9.  Tabulate    the    same    data    to    show    the    distribution    for    25-cent 
classes. 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS  29 

10.  Construct  the  smoothed  frequency  curve  for  the  distribution  of 
temperature  and  of  precipitation  at   Columbus  since  1879. 

11.  From   data  obtained   from  a  financial  paper  construct  the   fre- 
quency distribution  of  the  prices  of  preferred  stocks  for  any  one  market 
day. 

12.  Do  the  same  for  a  common  stock. 

13.  Draw  the  smoothed  curve  of  the  following  weight  distribution  of 
Ohio  State  University  freshmen. 

Weight     Class  —  102     107     112     117     122     127     132     137     142     147     152 
Frequency  8      13      20      48      76      93      93     110      93      49      56 

Weight  Class  — 157     162     167     172     177     182     187 
Frequency  — 31     22     13     11       3      2      9 

The  weight  classes  are  here  of  width  five  pounds  and  the  middle 
value  of  each  class  is  taken  as  the  class  number.  Class  187  includes  all 
persons  with  weight  greater  than  184. 

14.  Construct  the  smooth  curve  of  the  distribution  of  ages  of  grad- 
uates from  the  Columbus  Public  Schools. 

Ages         —11       12      13      14      15      16      17      18 
Numbers—   0        7      45     186     114      61        8        0 

13.  Construct  the  frequency  curve  of  the  preferred  stock  data  of 
Exercise  11. 

14.  Do  the  same  for  the  common  stock  data  of  Exercise  12. 

Use  of  the  Frequency  Curve.  The  frequency  curve  does 
not  give  a  chronological  picture  of  the  variations  in  the  data. 
Instead  it  shows  the  number  of  times  that  each  value  occurs. 
The  frequency  curves  of  precipitation  for  a  dryer  climate  is 
located  to  the  left  of  that  for  a  more  moist  climate  because 
months  with  small  precipitation  occur  more  frequently  in  the 
dryer  region.  The  frequency  curve  of  higher  prices  lies  further 
to  the  right  than  does  that  of  lower  prices,  so  that  by  con- 
structing the  frequency  curves  it  can  be  readily  discovered  which 
series  of  prices  tends  to  be  higher. 

Exercises. 

15.  Compare   the   Top   Beef   Cattle    Prices   of   1895  with   those  of 
1915. 

16.  Compare  the  precipitation  at  Columbus  with  that  of  some  other 
station. 

Typical  or  Representative  Data.  Statistical  data  may 
be  collected  for  the  express  purpose  of  exhibiting  a  chronological 


3O  INTRODUCTION    TO    MATHEMATICAL    STATISTICS 

or  other  statement  of  the  variations.  This  sort  of  data  is  usually 
based  on  the  complete  enumeration  of  a  given  set  of  objects,  as 
the  census  of  population  to  apportion  the  members  of  the  House 
of  Representatives,  or  measures  of  stature  for  military  purposes. 

In  discussing  an  increase  in  prices  it  is  impossible  to  quote 
all  prices;  recourse  must  be  had  to  a  carefully  selected  list  of 
prices.  The  condition  of  trade  in  certain  industries  is  taken  as 
indicative  of  the  condition  of  all  business.  In  comparing  the 
prices  of  beef  and  the  prices  of  corn  the  real  object  of  investiga- 
tion is  to  find  an  underlying  connection  between  the  two  series 
of  values — a  connection  which  will  hold  good  in  any  particular 
year.  In  such  a  study  the  historical  statistics  of  the  two  price 
variations  are  in  reality  used  as  representative,  as  typical,  of 
the  manner  in  which  the  two  prices  are  related.  It  is  apparent 
that  the  frequency  form  of  distribution  is  peculiarly  adapted  to 
typical  data. 

The  Errors  in  Representative  Data.  The  theory  of 
enumerative  statistics  is  simple  in  statement ;  the  chief  cares 
of  the  statistician  are  that  all  objects  are  counted  and  none 
counted  more  than  once,  and  that  an  adequate  and  effective 
method  of  presentation  is  adopted.  There  are  also  complicated 
questions  of  the  methods  of  collecting  the  data  and  of  the  limits 
of  accuracy  of  the  data  but  these  are  met  with  in  data  of  either 
form. 

Because  it  is  practically  impossible  to  secure  homo- 
geneous data;  that  is,  data  in  which  the  values  for  all  char- 
acteristics except  those  under  consideration  are  the  same  for 
all  variates,  representative  data  must  be  examined  for  homo- 
geneity. For  instance,  the  persons  whose  heights  are  tabulated 
at  the  beginning  of  this  chapter  differ  in  age,  early  environ- 
ment, physical  condition,  as  well  as  in  height  so  that  the  given 
distribution  is  in  reality  a  distribution  of  a  complex  of  attributes 
instead  of  merely  the  one  attribute,  height.  Unless  the  influence 
of  these  various  factors  is  carefully  studied,  serious  errors  may 
result  from  attempting  to  apply  to  another  distribution  the  con- 
clusions drawn  from  this  distribution. 

It  may  be  shown  that,  from  absolutely  homogeneous  mate- 
rial successive  samples  made  up  strictly  at  random,  that  is,  with- 
out bias  or  prejudice,  will  most  likely  give  materially  differing 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS  31 

distributions.     The  extent  of  such  errors  must  be  understood  in 
reasoning  from  one  distribution  to  another. 

Hence  in  working  with  typical  or  representative  data  care 
must  be  taken  regarding  (i)  the  limits  of  accuracy  of  the  data; 
(2)  the  homogeneity  of  the  data;  (3)  the  errors  of  random 
sampling. 


CHAPTER  IV. 

The  Arithmetic  Mean.  Let  us  add  the  January  prices 
in  the  data  of  page  25,  and  then  divide  the  sum  by  the  number 
of  items.  The  result  is  $8.33.  In  this  way  a  number,  the  arith- 
metic mean,  is  obtained.  The  characteristic  arithmetic  property 
of  this  number  is  that  each  of  the  given  data  values  may  be 
replaced  by  it  without  altering  the  total  sum  of  all  the  values. 

It  is  usual  to  speak  of  the  arithmetic  mean  simply  as  the 
mean  unless,  in  order  to  distinguish  the  arithmetic  mean  from 
some  other  mean,  there  is  special  need  for  the  defining  word 
"arithmetic" 

Exercises. 

1.  Determine  from  the  date  of  Exercise  1,  Chapter  I,  the  arithmetic 
mean  of  the  monthly  rainfall  at  Columbus,  for  March. 

2.  Determine  from  the  data  of  Exercise  5,  Chapter  I,  the  arithmetic 
mean  of  the  annual  precipitation  at  Columbus. 

3^  Find  from  the  data  of  page  25  the  arithmetic  mean  of  the  1895 
Top  Beef  Cattle  prices  and  compare  with  the  1915  mean. 

4.  On   the   assumption   that  the   population    of   the   United    States 
increased   uniformly   from   1900   to   1910   find   the    value   of   the   annual 
increase  and  then  the  estimated  population  for  1906. 

5.  Compute  the  arithmetic  mean  of  the  Monthly  Top  Beef  Cattle 
prices  for  the  years  1895  to  1916. 

6.  By  first  assigning  each  monthly  price  to  the  appropriate  50-cent 
class  as  on  page  25   and  computing  the  arithmetic  mean  of  the   prices 
when   so   altered    determine   the   effect   on   the   value   of   the    arithmetic 
mean   of    substituting   the   class   prices    for   the   exact  values.     Use   the 
class  numbers  in  the  computation  and  translate  the  result  in  terms   of 
the  proper  interval. 

7.  In   Exercise   5  there   are  264   entries   in   the   sum   to  be  added. 
Show  that  much  of  the  labor  of  the  addition  can  be  avoided  by  selecting 
the  equal  prices,  then  multiplying  each  by  the  number  of  times  it  occurs, 
and  adding  the  resulting  products  to  obtain  the  total  sum  of  prices. 

The  results  of  Exercises  5,  6  and  7  suggest  the  computing 
of  the  mean  from  a  frequency  table  in  accordance  with  the 
following  rule :  multiply  each  deviation  by  its  frequency,  add  the 
resulting  products,  and  divide  this  total  sum  by  the  total  fre- 
quency. The  quotient  is  the  value  of  the  mean.  Thus,  from  the 
frequency  distribution  of  Top  Beef  Cattle  Prices  of  Chapter  III, 
obtained  on  page  26  —  6,  13,  35,  43,  30,  30,  18,  12,  15,  16,  17,  3, 
6,  7,  i,  —  the  mean  price  is  given  by  the  expression  — 

(32) 


d  = 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS  33 

1x6  +  2x13  +  3x35  +  4x43  +  5x30  +  6x30  +  7x18  +  8x12  + 


252 

+  9  x  15  +  10  x  16  +  11  x  17  +  12  x  3  + 13  x  6  +  14  x  7  +  15  x  1 

252 

=  6.00,  where  d  is  the  distance  of  the  computed  mean  from  the 
origin. 

The  mean  class  is  thus  6.00;  that  is,  6  of  the  5O-cent  in- 
tervals. This  gives  7.25,  the  mid-value  of  class  6,  as  the  mean 
price. 

Whenever  the  frequency  table  is  available  the  method  just 
described  is  usually  the  shortest  method  for  computing  the  value 
of  the  mean.  However  if  the  frequency  distribution  is  not 
needed  for  any  other  purpose  and  especially  if  an  adding  ma- 
chine is  at  hand  the  saving  of  time  in  the  computation  of  the 
mean  does  not  ordinarily  justify  the  compilation  of  a  frequency 
table  merely  for  the  one  purpose  of  finding  the  mean. 

The  following  is  the  computation  for  mean  height  from 
the  data  at  the  beginning  of  Chapter  III. , 

Let  us  take  the  origin  at  height  60.  Then  the  computation 
scheme  will  be  as  follows : 


Computation 

of  the   Mean. 

Dev.  times 

Class. 

Deviation. 

Frequency. 

Freq. 

61 

1 

2 

2 

62 

2 

10 

20 

63 

3 

11 

33 

64 

4 

38 

152 

65 

5 

57 

285 

66 

6 

93 

558 

67 

7 

106 

742 

68 

8 

126 

1,008 

69 

9 

109  ' 

981 

70 

10 

87 

870 

71 

11 

75 

825 

72 

12 

23 

276 

73 

13 

9 

117 

74 

14 

4 

56 

750  5,925 


5,925 

d- =  7.9 

750 

TABLE  II. 


34  INTRODUCTION    TO    MATHEMATICAL    STATISTICS 

Hence  the  mean  height  is  7.9  classes ;  that  is,  7.9  inches  from 
the  origin,  and  is  therefore  equal  to  67.9  inches. 

Statistical  Properties  of  the  Arithmetic  Mean.  What  is 
the  statistical  significance  and  interpretation  of  the  arithmetic 
mean?  If  a  higher  price  were  substituted  for  one  of  the  January 
beef  cattle  prices  the  resulting  arithmetic  mean  would  be  larger, 
but  not  so  much  larger  as  the  individual  price  because  in  the 
process  of  obtaining  the  mean  the  price  increase  is  divided  by 
the  total  number  of  prices.  Hence  a  larger  mean  denotes  that, 
as  a  whole  the  values  of  the  distribution  are  greater,  and  a 
smaller  arithmetic  mean  is  to  be  interpreted  as  indicating  a 
relative  lower  series  of  values.  And  since  all  increases  and 
decreases  are  to  be  divided  by  the  number  of  varieties  the 
changes  in  the  value  of  the  arithmetic  mean  are  relatively 
smaller  than  those  of  the  individual  values.  Thus  a  decrease 
of  50  cents  must  occur  in  each  of  the  above  prices  in  order  to 
decrease  the  arithmetic  mean  by  the  same  amount.  A  decrease 
of  50  cents  in  one-half  the  variates  decreases  the  arithmetic  mean 
by  only  25  cents,  and  so  on.  That  is,  the  arithmetic  mean  is 
relatively  more  stable  than  is  an  individual  measurement.  Thus 
if  several  groups  of  750  students  were  measured  for  height  and 
the  frequency  distribution  tabulated  and  the  means  computed 
for  each  group  it  would  be  found  that  the  means  would  differ 
but  little  while  the  frequency  of  any  one  class,  67  inches  for 
instance,  would  vary  considerably  from  distribution  to  distribu- 
tion. 

It  is  to  be  noted  that  a  single  increase  of  50  cents  in  the 
price  for  one  month  has  exactly  the  same  effect  on  the  value  of 
the  arithmetic  mean  as  does  a  lo-cent  increase  .in  the  prices  of 
each  of  five  months.  But  is  this  true  statistically?  Should  the 
exceptionally  high  price  be  given  so  much  weight?  Should  the 
person  of  exceptional  height  be  emphasized  so  strongly  in  the 
group  of  persons  whose  height  is  measured? 

That  is,  the  value  of  the  mean  may  not  always  be  significant 
because  a  part  of  its  value  may  be  due  to  the  presence  of  un- 
duly large  variates.  Whether  an  item  is  unduly  large  can  be 
determined  only  from  a  study  of  the  data  itself  for  the  mean 
conveys  no  information  whatever  as  to  the  distribution  of  the 
variates;  it  tells  only  of  their  general  size.  That  is,  the  statistical 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS  35 

function  of  the  arithmetic  mean  is  essentially  to  measure  the 
sice  or  magnitude  of  the  data  as  a  whole. 

Theorem.  ///  any  distribution  the  sum  of  the  deviations 
from  the  mean  is  zero.  That  is,  the  sum  of  the  positive  devia- 
tions is  equal  to  the  sum  of  the  negative  deviations.  The 
distance  of  the  mean  from  any  origin  is  obtained  by  taking 
the  sum  of  the  deviations  from  that  origin  and  dividing  by 
the  total  frequency,  hence  when  this  distance  is  zero  the  sum 
of  the  deviations  must  be  zero. 

Weighted  Arithmetic  Mean.  An  apparent  modification 
of  the  arithmetic  mean  is  illustrated  by  the  following:  It  is 
desired  to  obtain  an  index  of  food  prices  by  taking  the  mean 
of  the  price  quotations  of  15  articles  of  food.  It  is  decided 
however,  that  one  of  the  quotations  should  be  given  twice 
the  weight  of  the  other  articles.  This  is  done  by  multiplying 
this  quotation  by  two  and  taking  the  double  quotation  in  the 
total  sum.  The  article  is  said  to  have  a  weight  of  two.  The 
idea  of  weight  introduces  no  new  principles  into  the  computation 
of  the  arithmetic  mean. 

Adjustment  or  Graduation  Formulas.  A  class  of  adjust- 
ment formulas  of  wide  and  convenient  adaptability  to  the 
smoothing  of  data  are  based  on  the  arithmetic  mean. 

A  series  of  terms  not  differing  greatly  from  each  other  may 
be  smoothed  by  replacing  each  by  the-  mean  of  the  five  terms, 
for  instance,  of  which  the  given  term  is  the  middle  term.  The 
distribution  obtained  by  the  first  adjustment  may  in  turn  be 
similarly  smoothed,  and  indeed  the  process  may  be  repeated  at 
pleasure.  In  this  way  the  various  graduation  formulas  of  this 
type  are  built  up.  Next  to  the  graphic  method  this  is  the  simplest 
method  for  the  smoothing  of  observations. 

Extensive  application  of  this  method  has  been  made  in 
the  graduating  of  mortality  tables,  and  under  the  name  of  the 
method  of  the  moving  average  it  is  often  used  in  smoothing 
data  in  which  the  general  trend  is  obscured  by  the  presence  of 
more  or  less  regular  fluctuations.  In  this  case  the  number  of 
classes  grouped  together  should  be  determined  by  the  lengths  of 
the  cycles  of  the  fluctuations.*  If  the  cycles  are  irregular  in 


*  Sec  King,  Elements  of  Statistical  Method,  sec.  97.     Also  quarterly 
Publications  of  the  American  Statistical  Society,  Dec.,  1915,  and  March, 


36  INTRODUCTION    TO    MATHEMATICAL    STATISTICS 

length  the  method  of  the  moving  average  is  not  likely  to  yield 
satisfactory  results. 

Exercises. 

8.  Smooth  the  data  of  Table  1,  by  taking  the  means  of  each  suc- 
cessive five  terms,   then  of  seven  and  finally  of  nine. 

9.  Apply  the  method  of  the  moving  average  to  smooth  the  data 
of  the  Top  Beef  Cattle  Prices.     Is  the  method  highly  applicable  to  this 
data? 

10.  Discuss  the  reliability  of  this  method  for  terms  at  the  end  of 
the  range. 

11.  Apply  the  five  term  method  to  the  distribution  of  Ex.  5,  Chap.  I. 

The  Geometric  Mean.  Let  the  price  of  a  certain  article 
for  each  year  from  1910  to  1915  be  expressed  as  a  percent  of 
that  of  the  preceding  year  as  follows  (assuming  100  for  the  1910 
price),  100,  105,  118,  109,  102,  115.  The  percent  increase 
from  1910  to  1915  is  obtained  by  multiplying  together  the  five 
percents  and  is  approximately  1.58.  What  uniform  percent  of 
increase  will  give  the  same  percent  of  increase  of  1915  over 
1910?  Let  (i  +  r)  be  the  constant  multiplier  or  percent.  Then 
we  have 

(i  -fr)5  =  105  X  n8X  109  X  102  X  115. 

=  1-58415. 

and     (i+r)    =  5V  1.58415* 
=  1.096. 

Each  of  the  unequal  increases  in  the  series  may  therefore 
be  replaced  by  the  percent,  1.096,  and  still  give  the  same  product. 

The  population  of  continental  United  States  in  1910  was 
91,972,266;  in  1900,  75,994,575-  On  the  assumption  of  a  uni- 
form rate  of  increase  during  the  decade  what  should  be  the 
value  of  this  uniform  rate  in  percent?  As  above,  we  have 

(i  +  r)10  =  91,972,266/75,994,575  =  1.21025. 
Hence   (i+r)     =  10V  1.21025, 
=  1.019. 

It  may  be  noted  that  according  to  this  method  the  popula- 
tion in  1906  was  equal  to  75,994,575  x  (1.019) 6. 

The  problem  in  the  case  of  the  arithmetic  mean  is  to 
find  a  uniform  number  which,  when  substituted  for  each  of 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS  37 

the  variates,  leaves  the  total  sum  unchanged.  In  problems 
similar  to  that  just  preceding  it  is  a  matter  of  finding  a  num- 
ber which,  when  substituted  for  each  of  the  given  numbers, 
leaves  the  product  of  all  the  numbers  unchanged;  such  a 
number  is  called  the  geometric  mean. 

Exercises. 

12.  Compute  the  geometric  mean  of  the  following  numbers :  2,  4,  8. 

13.  Compare  from  exercise  4,  the  1906  population  on  the  assump- 
tion of  a  uniform  annual  increase  with  that  obtained  from  the  assump- 
tion of  a  uniform  annual  rate. 

For  any  but  the  simplest  problems  the  computation  of  the 
geometric  mean  cannot  be  accomplished  without  the  use  of 
logarithms.  The  following  computation  of  the  geometric  mean 
of  student  heights  from  the  data  of  page  24  illustrates  the  process. 
The  geometric  mean  height  =  (•  6i2  •  6210  •  6311  •  6438  •  65" 
•  6693  •  6;106  •  68126  •  69109  •  70"  •  71™  •  72"  •  7398  •  j^yf™ 
and  750  log  geo.  mean  = 

2  log  61  +    IO  l°9  62  +    ir   l°9  63  + 

38  log  64  +    57  log  65  +    93  log  66  + 

106  log  67  +  126  log  68  -(-  109  log  69  + 

87  log  70+    75  log  71  +    23  log  72  + 

9  log  73  +      4  log  74. 

1 373 -70355 
Hence  log.  geo.  mean  =  -  -  =  1.83160 

750 
and  geo.  mean  height  =  67.86. 

Exercises. 

14.  Compute  the  geometric  mean  of  the  Monthly  Top  Beef  Cattle 
Prices. 

15.  Compete   the   geometric   mean    for  the    March   precipitation   at 
Columbus  for  the  years  since  1878. 

Properties  of  the  geometric  mean.  Unlike  the  arithmetic 
mean  the  geometric  mean  is  most  powerfully  affected  by  the 
smaller  deviations  because  a  small  factor  in  a  product  has  a 
proportionately  greater  influence  on  the  result  of  the  multiplica- 
tion than  does  a  larger  factor. 

Each  property  of  the  arithmetic  mean  has  a  corresponding 
property  for  the  geometric  mean  because  the  logarithm  of  the 


38 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS 


geometric  mean  is  the  arithmetic  mean  of  the  logarithms  of  the 
deviations.  From  this  logarithmic  correspondence  all  the  prop- 
erties of  the  geometric  mean  can  be  derived*  from  those  -of  the 
arithmetic  mean.  It  is  apparent,  for  instance,  that  the  geometric 
mean  applies  to  a  series  of  deviations  multiplied  together  in  a  way 
exactly  parallel  to  that  of  the  arithmetic  mean  and  a  series  of 
terms  to  be  added.  Other  parallels  are,  a  chain  of  relative  prices 
and  a  series  of  price  increases  ;  interpolation  on  the  assumption 
of  a  uniform  rate  and  of  a  uniform  increase;  compound  interest 
and  of  simple  interest. 

The  Median.  Let  the  years  1879  to  1916  inclusive  be 
arranged  in  order  of  the  March  precipitation  beginning  with  the 
lowest.  We  then  have  with  the  dataf  measured  to  hundredths 
of  an  inch  : 


1910.. 

0.28 

1885.. 

0.53 

1889.  . 

0.66 

1915 

1.19 

1895.. 

1.23 

1894 

.1.79 

1901  .  . 

1.82 

1905. 

1.87 

1893.. 

1.92 

1892.. 

2.23 

1911 

2.36 

1830.  . 

.  ...2.42 

1914.. 

.  ...2.46 

1887.. 

.  ...2.56 

1900.. 

.  ...2.59 

1902.  . 

.  ...2.63 

1909. 

.  ,  2.68 

1896.  . 

.  ...3.04 

1883.. 

.  ...3.20 

1834.. 

.  ...3.59 

1879 

3.77 

1888.. 

3.79 

1886.  . 

3.90 

1881  .  . 

4.01 

1903.. 

4.13 

1912 

...  .4.56 

1906.  . 

4.59 

1891.. 

4.64 

1899. 

4.69 

1882.  . 

4.76 

1904.. 

4.93 

1907.. 

5.21 

1897.. 

5.45 

1S90.  . 

5.63 

1908.. 

6.03 

1898 

7.03 

1913.. 

8.09. 

*  See  Zizek,  "Statistical  Averages,"  Chapter  III.  Also  Jevons,  "On 
the  Variation  of  Prices  and  the  Value  of  the  Currency  since  1782;' 
Jour.  Roy.  Stat.  Soc.,  Vol.  XXVIII,  1865.  Galton,  "The  Geometric  Mean" 
in  Vital  and  Social  Statistics,"  Proc.  Roy,  Soc.,  Vol.  XXIX,  1897,  p.  305. 
McAlister  "The  Law  of  the  Geometrical  Mean,"  the  same,  p.  367.  Yule, 
"An  Introduction  to  Statistics,"  p.  123. 

f  U.  S.  Weather  Bureau  Report,  Columbus  Station,  1017. 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS  39 

The  middle  year,  1883,  in  this  ordered  arrangement  is 
called  the  median  year  with  respect  to  March  precipitation ; 
the  median  precipitation  of  3.^0  inches,  being  that  of  the  median 
year. 

In  general  the  median  individual  is  denned  as  the  indi- 
vidual so  located  that  there  are  as  many  individuals  with  a 
greater  value  of  the  characteristic  as  with  a  less  value;  and  the 
middle  value  of  the  measured  characteristic  is  spoken  of  as 
tin-  median  value  of  the  characteristic. 

If  the  number  of  variates  is  even  the  medium  is  assumed  to 
lie  between  the  two  middlemost  variates. 

It  is  obvious  that  the  above  median  precipitation  year  might 
have  been  obtained  by  a  simple  process  of  counting  and  inspection 
of  the  data  without  the  somewhat  laborious  process  of  arrang- 
ing the  variates  in  order. 

Exercises. 

HI.  From  the  data  of  Exercise  2,  Chapter  III,  determine  the 
median  Columbus  monthly  temperature,  and  the  median  year  in  respect 
to  temperature. 

17.  From  the  price  data  of  page  25  determine  the  median  top  beef 
cattle  price. 

18.  From  the  data  of  Exercise  19,  Chapter  I,  determine  the  median 
price  for  wheat. 

*i 

When  the  data  is  in  the  form  of  a  frequency  distribution  the 
computation  of  the  position  of  the  median  is  much  facilitated. 
All  that  is  necessary  then  is  to  start  from  one  extremity  of  the 
distribution  and  include  successive  classes  until  half  the  total 
frequency  is  obtained.  The  only  point  of  difficulty  in  this  case 
is  when  the  median  is  located  within  a  class.  Then  it  is  necessary 
to  interpolate  within  the  median  class  for  the  more  exact  position 
of  the  median.  To  illustrate  the  method  of  interpolation  let  us 
find  the  median  student  height  from  the  data  at  the  beginning  of 
Chapter  III.  Half  of  the  number  of  variates  is  375.  Counting 
from  the  lower  extremity  \ve  find,  up  to  and  including  class  67,  a 
frequency  of  317,  so  that  it  is  necessary  to  take  58  individuals 
from  class  68.  Hence  we  may  assume  that  the  position  of  the 
median  will  be  58/126  of  a  unit  from  the  left  boundary  of  class 
68.  Since  this  boundary  is  at  67.5  the  median  is  located  at 
67.96  inches. 

Geometricallv,    the    median    deviation    locates    the    ordinate 


4O  INTRODUCTION    TO    MATHEMATICAL    STATISTICS 

which  divides  the  area  under  the  frequency  curve  into  two  equal 
parts. 

Exercises. 

21.  What  is  the  median  point  of  population  as  determined  by  the 
Bureau  of  the  Census  (see  pp.  50-52,  Vol.  L,  Report  of  the  13th  Census)  ? 

22.  Distinguish  the  median  point  of  population  from  the  center  of 
population. 

Quartiles.  Each  half  of  the  distribution,  one  on  either 
side  of  the  median,  may  be  divided  into  two  equal  parts.  These 
two  points  of  division  are  the  First  and  Third  Quartiles. 

The  two  quartiles  and  the  median  thus  divide  the  variates 
into  four  classes  of  equal  frequencies. 

In  data  having  predominately  large  frequencies  near  the  cen- 
ter of  the  distribution  the  quartiles  are  relatively  close  to  the 
median,  and  in  widely  scattered  data  the  quartiles  are  relatively 
far  from  the  median.  This  property  of  the  quartiles  is  developed 
and  applied  in  the  next  chapter. 

The  median  can  be  found  directly  from  the  cumulative  curve 
by  drawing  a  horizontal  line  thru  the  point  on  the  vertical  scale 
corresponding  to  half  the  total  frequency.  The  abscissa  of  the 
point  of  crossing  of  this  horizontal  line  and  the  curve  is  the  me- 
dian deviation. 

Exercises. 

19.  By   drawing   the    cumulative    curve    locate   the    median    sudent 
height. 

20.  From  the   frequency  distribution  of  top  beef  cattle  prices  of 
page  25  determine  the  median  price  by  using  the  cumulative  curve. 

Deciles.  The  decile  variates  are  the  variates  which 
separate  the  frequency  into  ten  equal  classes.  The  median  is  of 
course  the  fifth  decile  but  the  quartiles  are  not  deciles.  The  chief 
use  of  the  deciles,  like  that  of  the  quartiles,  is  in  determining  the 
shape  of  the  distribution. 

Exercises. 

23.  Determine   the   quartile   precipitations    from   the    data   of    Ex- 
ercise 5,  Chapter  I. 

24.  Determine  the  decile  precipitations  from  the  data  of  Exercise  3, 
Chapter  II. 

25.  Determine  the  quartile  and  the   decile  temperatures   from  the 
data  of  Exercise  2,  Chapter  III. 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS  4! 

26.  Determine  the  quartile  prices   from  the  top  beef  cattle  prices 
of  page  •_'•">. 

27.  Determine  the  quartile  top  beef  cattle  prices  from  the  data  in 
the  form  of  a  frequency  distribution  of  the  data  of  page  25. 

In  this  problem  the  quartile  prices  must  be  obtained  by  a  process  of 
interpolation  similar  to  that  described  for  the  median. 

Statistical  Properties  of  the  Median.  The  value  of  the 
median  ordinate  depends  not  on  the  actual  values  of  the  variates 
but  solely  on  the  relative  values.  The  data  need  be  given  with 
only  enough  exactness  to  permit  the  arrangement  of  the  variates 
in  order  with  respect  to  the  attribute  considered.  Moreover,  it 
is  only  the  arrangement  near  the  median  value  that  must  be  care- 
fully attended  to,  consequently  the  median  can  not  give  detailed 
information  of  the  variates  at  the  extremities  of  the  ranges. 

There  is  apparently  no  apriori  reason  why  the  value  of  the 
median  should  not  show  considerable  variation  from  sample  to 
sample  taken  from  the  same  material,  but  in  practice  it  is  found 
that  the  median  shows  as  high  if  not  higher  degree  of  stability 
than  does  the  arithmetic  mean.  Thus  if  a  second  group  of  750 
students  were  measured  as  to  height  and  the  median  computed 
it  would  most  likely  be  found  to  differ  only  slightly  from  that  of 
the  group  already  discussed.  This  slowness  of  change  in  the 
median  means  that  the  median  is  not  greatly  affected  by  the 
presence  of  accidental  and  irrelevant  influences.  That  is,  dif- 
ferences in  the  value  of  the  median  are  not  likely  to  be  merely 
accidental  and  hence  the  median  measures  significant  properties  of 
the  material.  For  instance,  a  distribution  of  wages  showing  a 
higher  median  wage  must  be  significantly  a  group  of  higher 
wages. 

The  properties  just  discussed  together  with  the  fact  that  the 
median  can  be  located  by  the  simple  process  of  counting  renders 
the  median  a  highly  important  average  in  practical  statistical 
work. 

The  Probable  Deviation.  The  median  variate  divides 
the  data  into  two  classes  of  equal  frequencies.  Hence  it  is  an  even 
chance  that  an  individual  selected  at  random  will  fall  into  a  desig- 
nated one  of  the  two  classes.  If  the  median  height  of  freshmen 
students  is  68  inches  it  is  an  even  bet  that  a  student  concerning* 
whose  height  nothing  is  known  has  a  height  less  than  68  inches. 

Likewise   it   is   an    even   bet   that   a   student   selected   at 


42  INTRODUCTION    TO    MATHEMATICAL    STATISTICS 

random  will  have  a  height  between  the  first  and  third  quar- 
tiles.  The  range  from  the  median  to  the  third  or  first  quartile, 
one-half  of  the  range  within  which  the  chances  are  even  for 
an  individual  measurement  to  He,  is  called  the  probable 
deviation.* 

Exercises. 

28.  Determine  the  probable  deviation  for  top  beef  cattle  prices. 

29.  Determine  the  probable  deviation  for  monthly  precipitation  at 
Columbus;  for  monthly  temperatures  at  the  same  station. 

30.  Show  that  the  probable  deviation  is  necessarily  connected  with 
the  frequency  distribution  and  not  with  a  chronological  distribution. 

The  Mode.  Notice  that,  in  the  frequency  distribution  of 
student  heights,  class  68  has  the  greatest  height  and  that  the 
high  point  on  the  frequency  curve  is  within  the  same  class. 
The  class  of  greatest  frequency  is  called  the  modal  class  and  the 
deviation  with  the  highest  ordinate  the  modal  deviation.  A 
mode  is  thus  defined  as  a  class  or  deviation  of  greatest  fre- 
quency; more  accurately,  it  is  the  class  or  deviation  of  greater 
frequency  than  that  of  either  the  class  immediately  greater  or 
immediately  less.  This  second  definition  allows  for  distributions 
having  more  than  one  mode. 

Exercises. 

31.  From   the   smoothed    frequency  curve   of   the   data   of   page  27 
determine  the  modal  monthly  precipitation. 

32.  Determine  the  modal  March  temperature  for  Columbus. 

It  is  possible  to  locate  the  mode  within  a  class  by  a  process 
of  interpolation  similar  to  that  described  in  the  determination  of 
the  median  but  by  far  the  easiest  method  is  to  construct  the 
smooth  frequency  curve  and  determine  the  abscissa  or  deviation 
of  the  greatest  ordinate. 

When  the  data  seems  to  have  more  than  one  mode  care 
must  be  exercised  in  deciding  whether  to  smooth  out  the 
apparent  modes.  In  the  frequency  distribution  of  monthly 
temperatures  it  is  evident  that  there  are  summer  and  winter 
modal  temperatures.  The  telephone-calls  data  of  Exercise  33 
below  shows  more  than  one  mode.  On  the  other  hand  the  data 
of  age  distribution  reported  by  the  United  States  Census  Bureau 


*  Certain  qualifications  of  this  definition  are  discussed  in  Chapter  V. 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS  43 

shows  a  tendency  for  the  frequencies  at  the  even  ages  to  be 
larger  than  at  the  odd  ages.  This  latter  tendency  is  partly  due  to 
the  fact  that  persons  who  are  uncertain  as  to  their  exact  age  seem 
to  show  a  preference  for  an  even  number.  These  apparent  modes 
should  be  smoothed  out.  Data  with  essentially  one  mode  is  said 
to  be  unimodal;  with  more  than  one  mode,  multimodal. 

Exercises. 

33.  Smooth  the  following  data  of  the  telephone  calls  for  one  day 
at  a  business  exchange*  and  locate  the  modes. 

Time    ....     6-7        7-8        8-9      9-10    10-11     11-12      12-1        1-2        2-3 
Calls     ....    .1595     3430     6389     6904     7282      7358      6361      5659      6186 

Time    ....     3-4        4-5        5-6        6-7        7-8        8-9      9-10     10-11     11-12 
Calls     ....    6597     6510     6093      4508      4210     2289     1197      916       314 

Time    ....    11-12 
Calls    ....       12 

34.  Do  the  same  for  the  following  residence  calls.** 

Time    ....     6-7        7-8        8-9      9-10     10-11     11-12      12-1         1-2        2-3 
Calls    ....    1256     3796     6604     4098     4240      3816      5852      4421      3136 

Time    ....     3-4        4-5        5-6        6-7        7-8        8-9      9-10     10-11     11-12 
Calls     .*'..    4344     3267     4-541      4778      4039     2088      1176      655       187 

35.  Determine  the  modal  classes  for  the  top  beef  cattle  prices. 

Statistical  Properties  of  the  Mode.  Because  the  neces- 
sary modifications  are  easily  made  for  multimodal  data  the  prop- 
erties of  the  mode  are  here  discussed  only  for  a  unimodal  dis- 
tribution. 

Since  the  modal  class  or  deviation  is  that  of  greatest  fre- 
quency ;  that  is,  since  more  variates  belong  to  that  class  than  to 
any  other,  the  mode  is  the  most  typical  of  all  the  variates  of 
a  distribution.  If  any  one  variate  is  to  be  selected  as  decrip- 
tive  of  the  data  the  modal  variate  should  be  that  variate. 
The  mode  is  accordingly  said  to  define  the  type  of  the  dis- 
tribution. The  significance  of  the  mode  as  a  type  depends,  of 


*  By  permission   of   Central   Union   Telephone   Company,   Columbus. 
M;iin  Exchange. 

**  Same,  North  Exchange. 


44  INTRODUCTION    TO    MATHEMATICAL    STATISTICS 

course,  on  the  relative  preponderance  of  its  frequency.  Thus 
the  frequency  of  height  68  in  the  case  of  the  student  dis- 
tribution of  page  24  is  126  and  the  combined  frequency  of  the 
classes  near  the  modal  class  is  a  large  percent  of  the  total  fre- 
quency. In  the  beef  cattle  prices  of  page  25  the  modal  class 
has  a  frequency  of  43  and  there  is  not  as  rapid  falling  off  in  the 
frequency  on  either  side  of  this  class  as  is  shown  by  the 
height  data.  Hence  in  the  price  data  the  mode  does  not  have  as 
great  significance  as  it  does  in  the  height  data.  Data  show- 
ing a  strong  tendency  to  concentrate  about  the  mode  is  said 
to  be  highly  stable  or  true  to  type.  Measures  of  trueness 
to  type  are  discussed  in  the  following  chapter. 

The  position  of  the  mode  depends  only  on  the  values  of  a 
few  variates  so  that  the  mode  like  the  median  gives  little  infor- 
mation of  the  extremes  of  the  range. 

The  mode  cannot  be  accurately  determined  by  a  simple 
process  of  arithmetic  as  can  the  median  and  the  mean. 

The  mode  being  the  predominating  value,  the  type,  the  fash- 
ion, it  is  what  is  ordinarily  in  the  popular  mind  when  an  average 
is  spoken  of.  The  statement  that  the  average  person  spends  one- 
third  of  his  income  in  rent  is  most  likely  to  mean  that  more  per- 
sons spend  about  that  per  cent  than  any  other  per  cent. 

Exercises. 

36.  Determine  the  modal  class  for  each  frequency  distribution  of 
Chapter  III. 

37.  Show  that  the  concept  of  mode  does  not  apply  to  a  curve  of  the 
historical  type. 


CHAPTER  V. 
THE  FORM  OF  A  DISTRIBUTION. 

Dispersion.  It  is  stated  in  the  preceding  chapter  that 
the  significance  of  the  mode  as  a  representative  of  the  data  de- 
pends on  the  extent  to  which  the  data  conforms  to  the  mode  as  a 
type.  That  is,  if  the  sum  of  the  frequencies  near  the  mode  is  a 
relatively  large  per  cent  of  the  total  frequency  the  modal  devia- 
tion is  highly  typical  and  the  data  is  not  highly  variable.  The 
word  variable  is  used  because,  if  in  the  data  a  certain  type  does 
not  predominate,  different  samples  will  have  a  tendency  to  show 
widely  differing  distributions.  If,  to  illustrate,  the  modal  fre- 
quency of  a  second  distribution  of  the  heights  of  750  students  is 
only  95  with  a  similar  reduction  in  the  other  larger  frequencies, 
this  second  distribution  is  not  so  true  to  the  type  expressed  by  the 
mode  as  is  the  first  distribution. 

To  repeat,  a  distribution  with  small  frequencies  at  the 
ends  of  the  ranges  and  with  the  frequencies  concentrated  at 
a  point  is  said  to  be  true  to  type,  to  be  highly  stable.  Let  us 
investigate  various  methods  of  measuring  the  extent  to  which 
the  data  is  scattered  or  dispersed  about  the  class  of  concen- 
tration. 

Measures  of  Dispersion.  Because  the  breadth  of  the 
range  depends  on  the  usually  uncertain  data  at  the  extremes  it 
does  not  furnish  a  reliable  measure  of  the  extent  to  which  the 
data  is  spread-out.  As  given  on  page  24  the  range  of  student 
heights  is  14  inches ;  the  inclusion  of  a  single  student  of  height  58 
inches  would  increase  the  range  by  more  than  twenty  percent. 

We  have  seen  that  in  theory  the  dispersion  should  be  meas- 
ured from  the  mode  but  in  practical  statistical  work  the  mean, 
median  and  mode  differ  so  little  in  position  that  it  is  ordinarily 
permissible  to  measure  the  disperson  from  the  mean. 

The  sum  of  the  deviations  about  the  mean  is  useless  as  a 
measure  of  dispersion  because,  as  was  proved  on  page  35,  this 
sum  is  zero  regardless  of  the  spread  or  dispersion  of  the  dis- 
tribution. 

Mean  Deviation.  Since  the  object  in  measuring  disper- 

(45). 


46  INTRODUCTION    TO    MATHEMATICAL    STATISTICS 

sion  is  to  determine  the  divergences  of  the  variates  from  an 
average  it  is  the  amount  of  a  divergence  that  counts  and  not 
its  direction.  Hence  a  logical  measure  of  dispersion  is  obtained 
by  adding  the  divergences,  all  counted  positive,  and  then  divid- 
ing the  sum  by  the  total  frequency.  This  gives  the  mean 
deviation. 

The  form  for  the  computation  of  the  mean  deviation  is  the 
same  as  for  the  arithmetic  mean  except  that  all  deviations  are 
measured  from  the  mean,  median  or  mode,  whichever  is  chosen 
for  the  origin,  and  all  negative  signs  are  disregarded. 

Exercis3s. 

1.  Compute  the  mean  deviation  from  the  arithmetic  mean  of  the 
Student  Height  Data  of  page  24. 

Referring  to  the  computation  for  the  arithmetic  mean  on  page  88, 
let  us  add  a  column  obtained  by  taking  the  difference  between  the  mean 
and  each  deviation  and  then  multiply  these  differences  by  the  respective 
frequencies  and  add  the  resulting  products.  This  sum  is  then  divided 
by  the  total  frequency  in  order  to  obtain  the  mean  deviation.  We  thus 
have: 

Computation  of  the  Mean  Deviation. 


Class  Xo. 

Diff, 

Freq. 

Prod. 

1 

6.9 

2 

13.8 

2 

5.9 

10 

59.0 

3 

4.9 

11 

53.9 

4 

3.9 

38 

148.2 

5 

2.9 

57 

165.3 

6 

1.9 

93 

176.7 

7 

0.9 

106 

95.4 

8 

0.1 

126 

12.6 

9 

1.1 

109 

119.9 

10 

2.1 

87 

182.7 

11 

3.1 

75 

232.5 

12 

4.1 

23 

94.3 

13 

5.1 

9 

45.1 

14 

6.1 

4 

24.1 

750       il,473.5 

1.9 

Mean  deviation  =  1 .9  classes.. 

Since  each  class  interval  is  one  inch  the  mean  deviation  is  1.9  inches.. 
TABLE  III. 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS  47 

2.  Compute  the  mean  deviation  about  the  arithmetic  mean  of  the 
price  data  of  page  _•">. 

3.  Compute  the  mean  deviation  about  the  median  of  the  price  data 
of  page  25  and  compare  the  result  with  that  of  Exercise  '2. 

4.  Compute  the  mean  deviation  about  the  arithmetic  mean  of  the 
precipitation  data  of  Exercise  o,  Chapter  I,  and  of  the  temperature  data 
of  Exercise  6  of  the  same  chapter. 

5.  From  the  frequency  tables  of  Exercises  2  and  3  of  Chapter  III 
compute   the   mean    deviation   of   monthly   precipitation    and    of   monthly 
temperature. 

For  purposes  of  comparing  the  stability  of  different  distribu- 
tions it  is  desirable  to  divide  the  mean  deviation  by  the  mean  or 
median,  whichever  is  used.  When  this  is  done  the  mean  deviation 
is  expressed  as  a  fraction  of  the  base  average.  For  instance,  it 
seems  reasonable  to  say  that  a  mean  deviation  of  0.3  with  an 
arithmetic  mean  of  20  has  the  same  significance  as  a  mean  devia- 
tion of  0.9  based  on  an  arithmetic  mean  of  60. 

Exercises. 

5.     Compare  the  dispersions  in  Exercises  1,  2,  3,  4. 

Because,  as  is  presently  proved,  the  mean  deviation  is  least 
when  taken  about  the  median  it  is  theoretically  best  to  compute 
the  mean  deviation  about  that  average.  When  so  done  there  is  a 
certain  degree  of  standardization  which  is  not  attained  with  any 
other  average  as  a  base,  but  the  point  is  not  of  great  practical  im- 
portance unless  the  median  and  the  arithmetic  mean  differ 
markedly. 

Proof  that  the  mean  deviation  is  rmallest  when  taken 
about  the  median. 

Let  P  be  a  point  on  the  line  S-T  between  the  points  A  and  B. 
The  sum  of  the  deviations  of  P  from  A  and  B  is,  without  regard  to 
the  sign  of  the  negative  deviation  PA,  PB  +  PA,  and  this  sum  is  equal 
to  AB.  If  P  should  lie  without  the  segment  AB  the  sum  of  the  two 
deviations  would  be  greater  than  AB.  Likewise  the  sum  of  the  distances 
of  P  from  any  other  two  points  C  and  D  is  least  when  P  lies  between 
them.  Hence  the  total  sum  of  deviations  of  P  from  any  number  of 
points  is  least  when  there  are  as  many  points  on  one  side  of  P  as  on 
the  other ;  that  is,  when  P  is  the  median  of  the  points. 

S  ACE  PB  DF  T 


48  INTRODUCTION    TO    MATHEMATICAL    STATISTICS 

Exercises. 

6.  According  to  the  measure  supplied  by  the  mean  deviation  which 
is    the    more   variable,    the    monthly    mean    temperature    or   the    monthly 
mean  precipitation  at  Columbus? 

7.  From  the  data  of  heights  on  page  24  and  the  data  of  weights 
of    Exercise    13,    Chapter    III,    determine    which    is    the    more    variable, 
student  height  or  student  weight. 

Statistical  Properties  of  the  Mean  Deviation.  The  mean 
deviation  as  a  measure  of  dispersion  has  all  the  properties  of  a 
mean  —  it  takes  all  the  variates  into  account ;  it  takes  each  variate 
according  to  its  size  and  consequently  may  give  more  prominence 
to  extreme  variates  than  their  statistical  importance  may  warrant ; 
it  is  computed  by  a  simple  process  of  arithmetic.  Because  in 
forming  it  only  the  numerical  values  of  the  deviations  are  used 
and  all  distinctions  between  positive  and  negative  deviations  are 
disregarded  the  mean  deviation  is  not  well  adapted  to  certain 
statistical  purposes  for  which  the  standard  deviation,  to  be  next 
discussed,  is  preeminently  fitted. 

Altogether  the  mean  deviation  is  an  index  of  dispersion  of 
practical  importance  and  should  ordinarily  be  used  either  alone 
or  in  connection  with  other  measures. 

The  Standard  Deviation.  The  mathematically  simplest 
device  for  eliminating  negative  signs  is  by  squaring  the  terms. 
Hence  if  the  difference  between  each  deviation  and  the  mean 
be  squared,  the  sum  of  the  squares  added  and  the  resulting 
sum  divided  by  the  total  frequency  the  mean  squared  devia- 
tion thus  obtained,  is  a  measure  of  dispersion  which  is  arith- 
metically more  convenient  than  is  the  mean  deviation. 

The  computation  of  the  mean  squared  deviation  differs  from 
the  computation  of  the  mean  deviation,  which  is  illustrated  under 
Exercise  i,  only  in  that  the  deviation  differences  are  squared  be- 
fore multiplication  by  the  frequencies.  It  is  of  course  possible 
to  compute  directly  from  the  data  without  using  the  frequency 
table  but  only  a  slight  error  is  introduced  by  the  combining  of 
the  actual  values  into  reasonably  narrow  classes  and  much  labor 
is  ordinarily  saved  because  only  one  multiplication  is  then  re- 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS  49 

quired  for  each  class  instead  of  for  each  individual  variate  as  is 
necessary  if  the  frequency  distribution  is  not  used. 

Exercises. 

6.  Determine    the    mean    squared    deviation    about    the    arithmetic 
mean  of  the  data  of  Student  Heights. 

7.  Do  the  same  for  the  Prices  of  Top  Beef  Cattle. 

tf.     Do  the  same  for  Monthly  Precipitation  at  the  Columbus  Station. 
!>.     Do  the  same  for  Monthly  Temperatures  at  the  Columbus  Station. 

The  above  method  of  computing  the  mean  squared  deviation 
involves  fractional  differences  in  the  deviations.  By  the  follow- 
ing modification  fractions  can  be  avoided. 

Short  Rule  for  the  Mean  Squared  Deviation.  Select  an 
integral  deviation  near  the  actual  arithmetic  mean  and  find  the 
difference  between  each  deviation  and  this  selected  deviation. 
Square  each  of  the  differences  so  obtained,  multiply  by  the  cor- 
responding frequency,  add>  and  divide  by  the  total  frequency. 
The  result  is  the  mean  squared  deviation  from  the  selected  value. 
To  obtain  the  mean  squared  deviation  from  the  arithmetic  mean 
all  that  is  necessary  is  to  subtract  from  the  value  just  com- 
puted the  square  of  the  difference  between  the  true  arithmetic 
mean  and  the  selected  integral  value.  If  the  mean  squared 
deviation  about  the  actual  arithmetic  mean  is  denoted  by 
the  Greek  letter  a-,  (sigma),  and  the  mean  squared  deviation 
about  any  other  point  by  the  same  symbol  written  with  a  prime, 
o-';  we  have,  on  recalling  that  the  letter  d  is  used  to  denote  the 
deviation  of  the  arithmetic  mean  from  the  origin,  the  following 
formula : 


To  prove  this  formula  let  the  deviations  from  the  original 
origin  be  denoted  by  X  and  the  deviations  from  the  arithmetic 
mean  by  .r  and  let  the  distance  of  the  mean  from  the  original 
origin  be  denoted  by  d.  Then  X  =  x  +  d  for  each  individual  in 
the  distribution  and  x  =  X  —  d. 

The  standard  deviation  is  obtaineed  by  squaring  each  x  and 
dividing  by  the  total  frequency.  Performing  these  operations 
we  have 


5O  INTRODUCTION    TO    MATHEMATICAL    STATISTICS 


N 
...  -2d     X        X 


N 

Na'2 

= Nd-/N  =  </-  —  d- 

N 

Since  X^  +  X.2  -\-  .    .    .is  zero  by  the  Theorem  of  page  35. 

Exercises. 

10.     Recompute  by  the   shortened   method   the   mean    squared   devia- 
tion about  the  arithmetic  mean  of  the  Student  Heights. 

Let  us  take  the  assumed  origin  at  class  68.  The  mean 
squared  deviation  is  then  obtained  by  the  following  computa- 
tion : 

Dev. 

Class.     Dev.     squared.       Freq.  Prod. 

1  7            49                 2  98 

2  6  i         36                10  360 

3  5  ,          25                11  275 

4  4            16               38  608 

5  3             9           .   57  513 
624                93  372 
71              1              106  106 

8  0  0  126  0 

9  1  1  109          109 

10  2  4  87  348 

11  3,  9  75  675 

12  4  16  23  368 

13  5  25  9  225 

14  6  36  4  144 

~~  720       )4032~ 
5.60 

d  =  68  —  67.9  =  0.1;    d*  =  0.01. 
Therefore  a2  =  5.60  — 0.01  =  5.50 
and  a-  =  2.31. 

TABLE  IV. 

11.  Using  the  shortened  method  compute  the  mean  deviation  for 
the  Monthly  Precipitation  data. 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS  51 

The  Standard  Deviation.  The  square  root  of  the  mean 
squared  deviation  about  the  arithmetic  mean  is  called  the 
Standard  Deviation. 

From  the  formula  «•-  =  a"-  —  d-  it  is  seen  that  any  other 
mean  squared  deviation  always  exceeds  the  square  of  the  stand- 
ard deviation  by  the  square  of  the  difference  between  the  as- 
sumed origin  and  the  true  arithmetic  mean. 

This  gives  a  certain  practical  and  theoretical  preference  to 
the  standard  deviation  over  that  of  any  other  mean  squared 
deviation.  For  this  reason,  and  because  certain  other  computa- 
tions are  rendered  simpler  by  so  doing,  the  mean  squared  devia- 
tion about  any  other  value  than  the  arithmetic  mean  is  seldom 
computed  even  tho  the  idea  of  trueness  of  type  centers  about  the 
mode.  Since  the  mean  and  the  mode  rarely  differ  by  more  than 
a  small  amount  the  square  of  this  difference  will  be  relatively 
still  smaller  and  as  a  result,  the  difference  between  the  square  of 
the  standard  deviation  and  the  mean  squared  deviation  about  the 
mode  is  ordinarily  negligible. 

Properties  of  the  Standard  Deviation.  Since  a  small 
value  for  the  standard  deviation  can  arise  only  when  the  variates 
are  closely  concentrated  about  the  mean  or  mode  and  since  a 
large  value  must  be  due  to  a  relatively  high  frequency  of  the 
variates  near  the  extremes  of  the  distribution,  the  standard  devia- 
tion is  a  measure  of  the  dispersion  of  the  data.  Because  the 
effect  of  squaring  is  to  diminish  the  importance  of  the  smaller 
values  and  to  exaggerate  the  importance  of  the  larger  values  a 
small  value  for  the  standard  deviation  shows  conclusively  that  the 
data  is  highly  true  to  type  and  stable,  while  on  the  other  hand 
a  large  value  may  to  some  extent  be  due  to  the  presence  of  the 
larger  frequencies  of  the  extreme  variates  and  hence  not  alto- 
gether significant.  But  even  with  this  qualification  the  standard 
deviation  is  a  thoroly  practicable  and  reliable  index  of  the  dis- 
persion of  the  data. 

Exercises. 

12.  Discuss    the    comparative    variabilities    of    the    distributions    for 
which    the    standard    deviations    have    been    computed    in    the    preceding 
exercises  of  this  chapter. 

13.  Does  a  standard  deviation  of  2.4   for  height  denote  a  smaller 
variability  than  a  standard  deviation  of  15  pounds  for  weight? 


52  INTRODUCTION    TO    MATHEMATICAL    STATISTICS 

The  Coefficient  of  Variability.  As  in  the  case  of  the 
mean  deviation  the  significance  of  a  value  for  the  standard  devi- 
ation depends  on  the  size  of  the  variates.  A  variation  of  10 
feet  in  a  measurement  of  five  miles  is  of  the  same  degree  of 
accuracy  as  a  variation  of  2  feet  in  one  mile. 

It  is,  therefore,  reasonable  to  divide  the  standard  devia- 
tion by  the  mean  in  order  to  express  it  as  a  fraction  of  the 
size  of  the  variates.  This  quotient  is  ordinarily  quite  small, 
so  that  it  is  usual  to  multiply  it  by  100.  The  resulting  co- 
efficient—  100  times  the  standard  deviation  divided  by  the 
mean  —  is  called  the  coefficient  of  variability. 

Exercises. 

14.  Compare  the   value   of  the  coefficient   of   variability   for  height 
with  that  for  weight  as  shown  by  the  data  of  the  preceding  chapter. 

15.  Discuss  the  comparative   variabilities   of   March   and  July  tem- 
peratures as  recorded  at  the  Columbus  Station. 

16.  Apply  the   coefficient  of   variability  to  determine   which   is   the 
more    variable,    Columbus    monthly    temperature    or    Columbus    monthly 
precipitation.     Compare  the  variability  of  annual  temperatures  with  that 
of  precipitation. 

17.  Compare  the  value  of  the  coefficient  of  variability  with  that  of 
the  quotient  of  the  mean  deviation  by  the  arithmetic  mean. 

18.  Discuss  the  comparative  practical  usefulness  of  the  two  indices 
of  variability. 

The  Quartiles  as  Measures  of  Dispersion.  The  distance 
from  the  median  to  the  third  quartile  is  the  interval  that  in- 
cludes half  the  frequencies  on  the  right  of  the  median.  The 
distance  from  the  median  to  the  first  quartile  is  the  interval 
that  includes  half  the  frequencies  on  the  other  side  of  the 
median.  Now  if  these  distances  are  relatively  large  it  must 
mean  that  the  frequencies  at  the  center  are  not  large  in  compari- 
son with  the  total  frequency.  That  is,  if  the  first  and  the  third 
quartiles  are  close  together  the  distribution  must  be  closely 
concentrated  about  the  median ;  must  be  highly  typical ;  must 
show  a  low  degree  of  variability,  because  in  every  case  one-half 
the  total  frequency  is  included  between  these  two  quartiles 
and  if  the  interval  is  narrow  the  ordinates  must  be  tall,  that  is 
the  frequencies  in  the  center  must  be  predominatingly  large, 
in  order  to  include  half  the  total  frequency.  If  the  data  has 
a  flat  frequency  curve  so  that  the  degree  of  variability  is  large 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS  53 

and  the  trueness  to  type  small  the  two  quartiles  will  be  com- 
paratively far  apart. 

Ordinarily  the  distance  between  the  first  quartile  and 
the  median  is  approximately  equal  to  the  distance  from  the 
median  to  'the  third  quartile  so  that  the  distance  from  the 
median  to  the  third  quartile  is  taken  as  the  index  of  disper- 
sion_of  the  distribution.  This  distance  is  called  the  probable 
deviation. 

Since  half  the  total  number  of  frequencies  are  included 
between  the  two  quartiles  the  chances  are  even  that  an  in- 
dividual of  the  group,  selected  at  random,  will  have  a  deviation 
lying  between  the  quartile  deviations.  In  other  words,  the 
chances  are  even  that  an  individual  selected  at  random  from 
the  group  will  have  a  deviation  numerically  less  than  the  prob- 
able deviation.  If  in  one  group  of  750  students,  for  instance, 
it  is  an  even  bet  that  a  student  selected  at  random  has  a  height 
between  66  and  70  inches  and  in  a  second  group  the  range  for 
even  chances  is  from  67  to  69,  the  second  group  is  said  to  be 
the  more  true  to  type. 

Formula  for  the  Probable  Deviation.  The  probable 
deviation  can  always  be  found  by  the  simple  process  of  locating 
the  quartiles.  It  is  proved  in  the  following  chapter  that  for  a 
certain  special,  tho  very  frequently  occurring,  form  of  distri- 
bution the  probable  deviation  is  equal  to  the  standard  deviation 
multiplied  by  a  constant. 

In  symbols,  we  have  P.  E.  =  0.6745  <r,  where  the  symbol 
P.  E.  inherited  from  the  theory  of  errors  developed  by  Gauss 
denotes  the  probable  deviation. 

If  the  distribution  is  markedly  unsymmetrical  the  above 
formula  may  not  hold  accurately  and  there  are  symmetrical 
distributions  for  which  it  does  not  hold  exactly.  But  extreme 
accuracy  in  the  matter  of  an  index  of  dispersion  is  not  necessary 
or  desirable ;  the  formula  is  generally  used  regardless  of  the 
form  of  the  distribution. 

Exercises. 

19.  Compute  the  probable  deviation  of  the  Student  Heights. 

20.  Compute  the  probable  deviation  of  the  Student  Weights. 

21.  Compute  the  probable  deviation  of  the  Top  Beef  Cattle  Prices. 


54  INTRODUCTION    TO    MATHEMATICAL    STATISTICS 

Probable  Deviation  of  the  Arithmetic  Mean.  The  arith- 
metic mean  in  the  Student  Height  data  of  page  24  is  67.9 
inches.  The  mean  height  of  a  second  group  of  750  students 
from  the  same  student  population  would  most  likely  not  differ 
greatly  from  67.9  but  it  is  not  at  all  likely  that  it  would  be 
exactly  the  same  as  that  of  the  first  group.  Let  group  after 
group  be  taken  and  the  value  of  the  mean  computed  for  each 
group.  The  values  of  these  means  would  themselves  form 
a  frequency  distribution  from  which  a  mean  and  probable  devia- 
tion could  be  obtained. 

Now  if  the  student  data  is  highly  typical  and  stable  the 
variation  in  the  successive  means  will  be  within  a  small  range 
and  hence  the  probable  deviation  of  the  means  will  be  relatively 
small.  Let  us  assume  the  value  which  we  have  obtained  by 
actual  observation,  namely  67.9,  is  the  best  estimate  of  the  true 
mean  of  the  height  of  all  such  students;  that  is,  that  the  devia- 
tion of  greatest  frequency  in  the  frequency  distribution  of 
means  is  67.9.  Then  the  probable  deviation  in  this  distribution 
will  be  the  probable  deviation  of  the  mean.  It  can  be  proved  that 
this  probable  deviation  is  obtained  in  accordance  with  the  formula, 

<7 

P.  E.  of  mean  =  0.6745  — 


Exercises. 

22.  Compute  the  probable  deviation  of  the  mean  Monthly  Precipita- 
tion for  the  Columbus  Station. 

23.  Compute  .the  probable  deviation  for  the  Monthly  Top  Beef  Cat- 
tle Prices  of  page  25. 

Probable    Deviation    of    the    Standard    Deviation.      The 

probable  deviation  of  the  standard  deviation  may  be  explained 
by  a  process  of  reasoning  similar  to  that  for  the  probable  devia- 
tion of  the  mean.  The  formula  for  this  probable  deviation  is  : 

P.  E.  of  standard  deviation  =  0.6745 


Exercises. 

24.  Compute   the  probable   deviation  of   the   standard  deviation   of 
Student  Heights. 

25.  Compute  the   probable   deviation   of   the   standard   deviation   of 
Student  Weights. 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS  55 

•_'»>.  Which  is  the  more  variable,  the  standard  deviation  of  Student 
Heights  or  of  Weights? 

Statistical  Significance  of  the  Probable   Deviation.     The 

statistical  application  of  the  probable  deviation  may  be  illus- 
trated by  the  following  questions :  The  mean  height  of  a  group 
of  students  is  67.9  with  a  probable  deviation  of  1.78  inches. 
The  height  of  a  student  taken  at  random  from  a  second  group 
is  72  inches.  What  is  to  be  concluded  ?  That  the  two  groups 
are  taken  from  essentially  the  same  populations  or  that  they 
'all  are  taken  from  distinctly  different  populations?  That  is, 
how  many  times  may  a  deviation  exceed  the  probable  deviation 
and  still  be  assumed  to  come  from  the  same  material  ?  It  must 
be  apparent  that  this  is  a  fundamental  question  in  statistical 
analysis.  Further  discussion  of  it  is  deferred  to  the  following 
chapter. 

The  Deciles  as  Measures  of  Dispersion.  The  position 
of  the  deciles  shows  the  spread  of  the  variates  in  the  distribu- 
tion. If  the  deciles  near  the  middle  of  the  distribution  are 
close  together  and  the  deciles  near  the  beginning  and  the  end 
of  the  ranges  are  far  apart  the  distribution  is  highly  variable 
and  not  true  to  type.  Because  there  are  nine  decile  positions 
to  observe  in  a  distribution  the  decile  is  not  so  simple  a  measure 
of  dispersion  as  is  the  quartile  or  standard  deviation,  tho  this 
very  fact  of  greater  detail  may  in  some  cases  be  of  advantage. 

Exercises. 

'21.  By  the  use  of  the  deciles  compare  the  variability  of  monthly 
precipitation  at  the  Columbus  Station  with  that  of  monthly  temperatures 
at  the  same  station. 

Symmetrical  and  Asymmetrical  Distributions.  The 
curve  of  student  heights  is  essentially  of  the  same  shape  to  the 
right  of  the  highest  point  as  it  is  to  the  left.  It  is  a  symmet- 
rical curve.  (Fig.  VI.)  Statistically  the  fact  of  symmetry  means 
in  this  case  that  there  is  no  tendency  for  the  students  to  be  either 
tall  or  short;  that  there  is  no  selection  between  the  tall  and  the 
short ;  that  the  chances  for  a  tall  person  to  belong  to  the 
student  group  are  equally  as  good  as  those  of  a  short  person ; 
that  there  is  absolutely  no  connection  between  being  a  member 
of  this  student  group  and  being  tall  or  being-  short. 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS 


FIG.  VI.    A  Symmetrical  Curve. 

On  the  other  hand  the  curve  of  height  of  the  members 
of  a  police  force  would  have  a  longer  range  to  the  right 
than  to  the  left  because  extremely  short  persons  are  excluded. 
The  curve  in  this  case  is  said  to  be  asymmetrical.  Asymmetry 
in  a  curve  denotes  the  presence  of  selection  in  the  data;  of  a  de- 
pendence; of  an  expressed  preference  for  certain  values  of  the 
attribute. 

\ 


FIG.  VII.     An  Asymmetrical  or  Skew  Curve. 

Exercises. 

28.  Examine  each  frequency  curve  of  Chapter  III  for  symmetry  and 
discuss  the  significance  of  each  case  of  asymmetry. 

The  Position  of  the  Averages  and  Asymmetry.  In  the 
symmetrical  curve  the  mean,  median  and  mode  coincide.  The 
cutting  off  of  the  range  to  the  left  tends  to  move  the  mean 
to  the  right  because  the  longer  deviations  are  to  the  right, 
and  it  has  been  seen  that  the  mean  is  most  affected  by  the 
longer  or  extreme  deviations.  This  places  the  median  at  the 
left  of  the  mean.  The  mode  will  tend  to  be  moved  to  the  left  of 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS  $? 

the  median  because  both  of  the  effect  of  the  moving  of  the  mean 
to  the  right  and  of  the  shortening  of  the  left  range  with  a  con- 
sequent heaping  up  of  the  frequencies  within  the  left  half.  The 
result  is  that  the  three  averages  are  then  in  the  order  —  mode, 
median,  mean.  It  has  been  verified  experimentally  that  for 
moderately  asymmetrical  distributions  the  distance  of  the  median 
from  the  mode  is  about  one-third  the  distance  of  the  mean 
from  the  mode. 

Skewness.  An  asymmetrical  curve  is  said  to  be  skew. 
Skewness  is  positive  when  the  longer  range  is  to  the  right 
and  negative  when  the  longer  range  is  to  the  left. 

Measures  of  Skewness.  Since  the  mode  and  mean  are 
separated  to  an  extent  depending  on  the  degree  of  skewness 
present,  a  logical  measure  of  skewness  is  the  difference  between 
the  mean  and  the  mode.  Because  a  large  difference  between  the 
positions  of  the  mean  and  the  mode  in  widely  spread-out  data 
may  not  be  so  significant  as  a  smaller  difference  in  highly  con- 
centrated data  it  is  advisable  to  divide  this  difference  by  the 
standard  deviation.  Hence  we  have, 

Mean  —  Mode 

Skewness  = • 


Exercises. 

29.  Compute  the  skewness  of  the  following  data  of  incomes : 

Estimated  Distribution  of  Income  among  the  Single  Women  of 
Continental  United  States  in  1890.  (King,  Wealth  and  Income,  p.  224) 

Class    0-$200    $200-$300    $300-$400    $400-$500    $500-$600 

No.   in   Thousands.         10  70  560  530  280 

Class  $600-1700        $700-$800         $800-$900         $900-$1000 

No.   in   Thousands.          150  120  37  22 

Class $1100-$1200    $1200-$1300    $1300-$1400 

No.    in    Thousands.  12  8  5 

30.  Show  that  the  above  formula  for  skewness  correctly  indicates 
the  sign  of  the  skewness. 

A  Second  Measure  of  Skewness  is  obtained  as  follows: 
Any  measure  of  skewness  must  take  into  account  the  distinction 
between  positive  and  negative  deviations.  The  total  sum  of 


58  INTRODUCTION    TO    MATHEMATICAL    STATISTICS 

deviations  from  the  mean  is  zero  regardless  of  the  form  of 
the  distribution ;  the  standard  deviation  involves  the  deviations 
as  squares  and  hence  obliterates  the  distinction  between  positive 
ang  negative  deviations.  The  mean  cubed  deviation,  however, 
will  serve  as  a  measure  of  skewness.  The  longer  deviations 
to  the  right,  if  the  skewness  is  positive,  will  be  more  powerfully 
affected  by  the  operation  of  cubing  than  will  the  shorter  devia- 
tions to  the  left  and  hence  the  total  sum  of  cubed  deviations 
will  be  positive.  It  is  well  to  extract  the  cube  root  of  the 
mean  cubed  deviation  and  then  in  order  to  express  the  skewness 
as  a  fraction  of  the  spread  of  the  distribution  to  divide  the  result 
by  the  standard  deviation. 

Exercises. 

31.  From  the  computation  form  of  Exercise  1  compute,  in  accord- 
ance with  the  second  method,  the  skewness  for  the  student  height  distri- 
bution. 

32.  Do  the  same  for  the  distribution  of  incomes. 


CHAPTER  VI. 
THE  NORMAL  PROBABILITY  CURVE. 

The  Equation  of  a  Frequency  Curve.  As  discussed  in 
Chapter  II,  a  smoothed  curve  is  a  graphic  estimate  of  what 
would  be  the  course  of  the  data  if  it  could  be  freed  from  acci- 
dental variations.  The  smoooth  curve  is  therefore  the  geometric 
representation  of  a  law  of  connection  or  variation.  It  shows, 
for  instance,  the  variation  of  temperature  with  the  seasons ;  the 
tendency  for  precipitation  to  depend  on  the  month  of  the  year ; 
the  most  likely  percent  of  students  at  each  height. 

The  presence  of  an  underlying  law  of  connection  in  the 
data  implies  the  presence  of  an  algebraic  law  connecting  the 
x  and  the  y  coordinates.  The  algebraic  statement  of  the  law 
expressing  y  in  terms  of  x  is  called  the  equation  of  the  curve. 

If  the  equation  is  given,  the  ordinate  can  be  computed  for 
any  abscissa  and  hence  the  curve  can  be  located  by  plotting  a 
sufficient  number  of  computed  points. 

In  some  distributions  it  is  possible  to  discover  a  law  of 
connection  directly  from  the  data,  and  then  without  an  extended 
computation  to  translate  this  law  into  the  proper  algebraic  form. 
We  shall  discuss  in  this  chapter  the  equation  of  only 
one  type  of  curve  —  the  normal  curve.  This  form  of  curve 
is  suited  to  the  representation  of  a  large  class  of  distributions. 
And  the  theory  of  the  normal  curve  can  be  made  use  of  in 
the  determination  of  the  probable  deviation  and  in  the  dis- 
cussion of  certain  other  properties  even  for  a  distribution  to 
which  it  does  not  apply  with  sufficient  accuracy  to  be  adopted 
as  the  form  of  the  smoothed  curve. 

Statistical  Theory  of  the  Normal  Curve.  The  height  of 
a  person  is  the  resultant  sum  of  a  large  number  of  elements 
such  as  the  length  of  certain  bones,*  the  widths  of  cartilages,  the 
erectness  of  posture.  And,  in  general,  any  statistical  data  can 
be  analyzed  into  elemental  components.  Whenever  these  ele- 
mental values  are  relatively  small  in  comparison  with  the  result- 
ant values  and  at  the  same  time  each  element  is  equally  likely  to 
take  any  value  within  a  small  ran(/e,  then  the  resultant  data  is 
said  to  be  normally  distributed. 

(59) 


6o 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS 


With  an  absence  of  selection,  as  is  assumed,  it  is  reasonable 
to  conclude  that  the  resulting  distribution  will  be  symmetrical. 
And  it  is  also  apparent,  after  some  consideration,  that  the  fre- 
quencies at  the  center  will  be  high  and  those  at  the  ends  of  the 
range  very  small.  It  may  be  noted  that  in  order  to  have  a  nor- 
mal distribution  it  is  not  at  all  necessary  that  it  be  possible  to 
actually  compute  the  values  of  the  elemental  factors;  it  is  only 
their  existence  under  the  above  assumptions  that  is  predicated. 

The  Equation  of  the  Normal  Curve.  It  can  be  mathe- 
matically demonstrated  that  the  equation  of  the  Normal  Curve 


is. 


N 


where  TV  is  the  total  frequency  of  the  distribution;  <r,  the  stand- 
ard deviation;  TT,  the  well  known  circle  constant  3.14159;  and  e 
is  a  constant  which  is  numerically  equal  to  2.71828.  In  this  form 
of  the  equation  .r  is  measured  from  the  arithmetic  mean  as 
origin. 

The  Graph  of  the  Normal  Equation.     Let  us  write  the 


normal  equation  in  the  form,   y  =  —  -      .e    2     °2and  then  in 


_ 

\/27T 


N 


the  form,  A'  =  —  .  Z  where  Z  =  —  —  •  e 


The  tables  of  Sheppard  give  the  value  of  Z  for  each  value 
of  x/cr  from  o.oo  to  6.00.  Table  V  serves  to  illustrate  the  com- 
plete table. 

Table  of  Ordinates  and  Areas  of  fthe  Normal  Curve 


X/* 

Z 

Areas. 

x/« 

Z 

Areas. 

0.0 

0.399 

0.000 

1.2 

0.194 

0.385 

0.1 

0.397 

0.040 

1.4 

0.150 

0.419 

0.2 

0.391 

0.079 

1.6 

0.111 

0.445 

0.3 

0.381 

0.118 

1.8 

0.079 

0.464 

0.4 

0.368 

0.155 

2.0 

0.054 

0.477 

0.5 

0.352 

0.191 

2.2 

0.035 

0.486 

0.6 

0.333 

0.226 

2.4 

0.022 

0.492 

0.7 

0.312 

0.258 

2.6 

0.014 

0.495 

0.8 

0.290 

0.288 

2.8 

0.008 

0.497 

0.9 

0.266 

0.316 

3.0 

0.004 

0.499 

1.0 

0.242 

0.341 

3.2 

0.002 

0.499 

TABLE  V. 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS  6l 

\Yith  a  table  of  values  of  Z  at  hand  the  plotting  of  a  normal 
curve  is  a  simple  matter  of  arithmetic.  Each  deviation  from 
the  mean  is  divided  by  the  standard  deviation  and  then  the 
table  is  entered  with  the  quotients,  x/<r,  and  the  values  of  Z  ob- 
tained by  interpolation.  Multiplication  of  the  interpolated  values 
by  the  ratio,  N/<r,  gives  the  successive  values  for  y. 

In  the  following  illustrative  plotting  of  the  normal  curve 
of  the  distribution  of  student  heights  the  ordinates  are  com- 
puted for  the  boundaries  (as  the  fractional  deviations  in  the  first 
column  of  Table  VI  denote)  instead  of  the  midpoints  of  the 
class  intervals.  This  is  done,  as  will  be  presently  explained,  in 
order  to  find  the  areas  under  the  curve.  In  the  computation 
scheme,  the  first  column  is  for  the  deviations;  the  second  for 
the  deviations  from  the  mean ;  the  third,  the  deviations  from 
the  mean  divided  by  the  standard  deviation;  the  fourth,  the 
values  of  s  obtained  from  the  table;  and  the  fifth  column  shows 
the  desired  values  for  the  ordinates.  The  sixth  and  seventh 
columns  are  explained  on  page  63. 

Table  of  Z's  and  Corresponding  Areas  for  Student  Height 


Student 

Deviations. 

X 

X/a 

Z 

F 

Areas.  Ht. 

Areas. 

0.5 

—7.4 

—3.20 

0.002 

1. 

0.999 

749 

1.5 

—6.4 

—2.77 

0.01 

3.2 

0.997 

748 

2.5 

—5.4 

—2.34 

0.03 

9.7 

0.990 

743 

3.5 

—4.4 

—1.91 

0.06 

19.5 

0.972 

729 

4.5 

—3.4 

—1.47 

0.14 

45.5 

0.929 

697 

5.5 

—2.4 

—1.04 

0.23 

74.7 

0.851 

638 

6.5 

—1.4 

—0.61 

0.33 

107.1 

0.729 

547 

7.5 

—0.4 

—0.17 

0.39 

126.6 

0.567 

425 

7.9 

—0.0 

—0.00 

0.40 

120.0 

0.500 

375^ 

8.5 

+0.6 

+0.26 

0.39 

126.6 

0.602 

452 

9.5 

+1.6 

+0.69 

0.31 

100.7 

0.745 

559 

10.5 

+2.6 

+  1.13 

0.21 

68.2 

0.871 

653 

11.5 

+3.6 

+1.56 

0.12 

39.0 

0.941 

706 

12.5 

+4.5 

+1.99 

0.06 

19.5 

0.997 

733 

13.5 

+5.6 

+2.43 

0.02 

6.5 

0.993 

745 

14.5 

+6.6 

+2.86 

0.01 

3.2 

0.998 

749 

TABLE  VI. 


62  INTRODUCTION    TO    MATHEMATICAL    STATIC 

The  computed  points  are  now  plotted  an 


FIG.  VIII.     The  Normal  Curve  of  Student  Heights. 

In  Figure  VIII  the  characteristic  bell  shape  of  the  normal 
curve  is  seen.    The  ordinates  at  the  center  do  not  change  rapidly. 
As  the  deviations  increase  the  ordinates   first    ' 
and  then  more  slowly  until  the  curve  flatten  oe 

almost  coincident  with  the  horizontal  •*-  : 

It  is  mathematically  evident  fron 
of  the  normal  curve  that  in  a  distribr 
deviation  the  values  for  3;  are  relatively  small  n^ 
relatively  large  for  the  greater  deviations.     That  : 
for  cr  indicates  a  flat  normal  curve. 


Exercises. 

1.  Plot  a  normal  curve  for  the  distribution  of  v 
the  data  of  page  29. 

2.  Compare  the  curve  obtained  in  Exercise  1  with  the  smooth  curve 
of  Chapter  III.     How  closely  do  they  coincide? 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS  63 

Areas  under  the  Normal  Curve.  For  each  class  the  area 
under  the  frequency  curve  and  between  the  class  limits  gives 
the  smoothed  class  frequencies.  In  Sheppard's  Table*  the  areas 
are  given  for  each  value  of  x/a  from  o.oo  to  6.00.  Table  V 
is  sufficiently  complete  for  illustrative  purposes.  The  tables  give 
the  areas  from  .r/o-  =  o  to  each  designated  value  of  x '/a.  (In 
Sheppard's  Tables  0.50,000,000  is  added  to  these  areas).  Hence 
to  determine  the  class  areas  from  the  tabular  values  appropriate 
subtractions  are  necessary,  except  in  the  case  of  the  central  area 
which  is  obtained  by  adding  the  two  central  fractional  areas.  It 
must  be  remembered  that  the  actual  area  is  finally  obtained  by 
multiplying  the  values  just  computed  by  the  total  frequency, 
and  not  by  the  total  frequency  divided  by  the  standard  devia- 
tion. 

The  sixth  column  of  Table  VI  contains,  for  the  frequency 
distribution  of  student  heights,  the  areas  from  the  origin;  the 
class  areas,  obtained  by  subtraction  or  addition,  are  entered  in 
the  second  column  of  Table  VII.  For  comparison  the  original 
frequencies  are  placed  in  the  third  column  of  the  same  table. 

The  Adjusted  Distribution  of  Student  Heights. 

Orig.      Pos.      '    Neg 
Class.       Areas.     Freq.      Biff.         Diff. 

1 
5 


1 

1 

2 

2 

5 

10 

.  „ 

3 

14 

11 

3 

4 

32 

38 

.  . 

5 

59 

57 

2 

6 

91 

93 

.  . 

7 

122 

106 

16 

8 

127 

126 

1 

9 

107 

109 

10 

94 

87 

7 

11 

53 

75 

12 

27 

23 

4 

18 

12 

9 

3 

14 

4 

4 

22 


748          750  36  38 

TABLE  VII. 


"Tables   for  Statisticians  and  Brometricians."     Table  II. 

i 


64  INTRODUCTION    TO    MATHEMATICAL    STATISTICS 

The  goodness  of  fit  of  this  normal  curve  is  indicated  by 
the  differences  of  the  fourth  and  fifth  columns.  The  differences 
are  taken  positive  when  the  adjusted  values  exceed  the  original 
frequencies.  The  sum  of  the  positive  and  of  the  negative  dif- 
ferences shows  a  fairly  close  fit,  though  the  size  of  the  individ- 
ual differences  must  also  be  taken  into  account  in  estimating  the 
closeness  of  fit. 

Exercises. 

3.  Test  the  closeness  of  fit  of  the  normal  curve  of  student  weights 
plotted  in  Exercise  1. 

4.  Compare  the  closeness  of  fit  of  the  normal  curves  of  weight  and 
height. 

Preliminary  Determination  of  Normality.  Before  fitting 
a  curve  to  a  given  distribution  the  data  should  be  analyzed  to 
determine  whether  the  fundamental  conditions  of  normality  - 
whether  each  value  of  the  data  is  the  sum  of  a  large  number  of 
relatively  small  elements  each  one  of  which  is  as  likely  to  have 
one  value  as  another,  etc.  —  is  satisfied.  The  data  should  be 
plotted  and  the  smooth  curve  drawn  by  the  methods  of  Chapter 
II.  If  the  one  or  the  other  or  both  of  these  tests  seem  to  in- 
dicate a  normal  distribution  a  normal  curve  should  be  fitted. 

A  more  elaborate  test  for  normality  is  to  compute  the  mean 
cubed  and  the  mean  fourth  power  deviation.  Unless  the  mean 
cubed  deviation  is  very  small  the  distribution  possesses  too  much 
-  skewness  or  asymmetry  to  be  closely  fitted  by  a  normal  curve. 
It  can  be  shown  that  for  a  distribution  to  be  normal  the  mean 
fourth  power  of  the  deviations  divided  by  the  square  of  the  mean 
squared  deviation  must  be  equal  to  3.  The  variation  that  may 
be  allowed  from  these  two  arithmetical  tests  is  shown  by  Tables 
XXXVII  and  XXXVIII  of  "Tables  for  Statisticians  and  Bio- 
metricians" 

However,  a  practically  conclusive  test  of  the  appropriate- 
ness of  the  normal  curve  is  that  of  comparing  the  adjusted  with 
the  original  frequencies. 

Exercises. 

5.  Discuss  the  advisability  of  attempting  to  fit  a  normal  curve  to 
the  precipitation  data  of  page  7. 

6.  Is   it  likely  that  the   frequency  distribution  of   March  tempera- 
tures will  be  more  nearly  normal  than  the  distribution  of  temperatures 
for  all  months? 


INTRODUCTION    TO    MATHEMATICAL   STATISTICS  65 

7.  Discuss  the  probable  fit  of  a  normal  curve  in  the  case  of  the 
top  beef  cattle  prices. 

8.  Is  it  likely  that  a  normal  curve  will  fit  the  income  data  of  page 
57. 

9.  What  does  a  divergence  from  normality  indicate? 

10.  What   reasons   are   there   for   thinking  that  the  distribution  of 
grades  in  a  large  class  of  students  should  be  normal? 

11.  What  reasons  might  exist  for  thinking  that  data  of  prices  might 
not  be  normally  distributed? 

Probable  Deviation  in  a  Normal  Distribution.  The 
quartiles  divide  the  two  halves  of  the  area  into  equal  parts; 
hence,  in  Table  V,  the  value  of  x/a  which  corresponds  to  an 
area  of  0.25,  gives  the  value  of  the  probable  deviation.  This  value 
of  x /a  is  there  found  equal  to  0.6745.  Therefore  the  deviation  of 
the  quartile  is  0.6745  times  the  standard  deviation.  This  demon- 
strates the  rule  for  obtaining  the  probable  deviation — multiplying 
the  standard  deviation  by  0.6745. 

The  formulas  for  the  probable  deviation  of  the  arithmetic 
mean  and  of  the  standard  deviation  referred  to  in  the  preceding 
chapter  are  derived  on  the  assumption  that  the  two  constants  are 
each  normally  distributed. 

It  can  be  shown  mathematically  that  even  when  the  form 
of  distribution  is  distinctly  non-normal  the  ordinary  rules  for 
finding  the  probable  deviations  hold  with  an  approximation  close 
enough  for  practical  purposes,  and  experimentation  with  dif- 
ferent forms  of  distributions  bears  out  the  mathematical  con- 
clusions. 

Exercises. 

12.  What    is    the    deviation    corresponding    to    the    ordinate    which 
marks  off  three-fourths  of  the  area  to  the  right  of  the  center? 

13.  What  part  of  the  area  under  the  normal  curve  is  included  be- 
tween the  median  and  the  ordinate  with  a  deviation  of  two  times  the 
standard  deviation?    Three  times  the  standard  deviation?    Four  times  the 
standard  deviation? 

The  results  of  Exercise  13  show  that  the  occurrence  of  a 
deviation  of  three  times  the  standard  deviation  is  highly  improb- 
able. That  is,  a  deviation  greater  than  about  three  times  the 
standard  deviation  must  significantly  indicate  that  the  measure- 
ment is  not  that  of  an  individual  taken  from  the  same  material ; 
it  does  not  belong  to  the  same  distribution  but  to  another  dis- 


66  INTRODUCTION    TO    MATHEMATICAL    STATISTICS 

tribution  which  has  some  conditions  different  from  the  first.  To 
illustrate,  the  standard  deviation  of  student  heights  is  2.36 
inches  and  the  mean  height  is  67.9  inches.  One  would  accord- 
ing to  this  theory  be  justified  in  concluding  that  a  person  with  a 
height  of  75  inches  (67.9  +  3x2.36  =  74.98)  does  not  belong 
to  the  student  group. 

13.  Does  the  theory  of  Exercise  13  accord  with  the  actual  distribu- 
tion of  the  student  heights? 

14.  Does  the  theory  of  Exercise  13  accord  with  the  actual  distribu- 
tion of  the  student  weights?     With  the  distribution  of  monthly  precipita- 
tion  for  the  Columbus   Station?     With   the   distribution   of   the   monthly 
temperatures  for  the  Columbus  Station? 

15.  On  the  basis  of  the  results  of  Exercises  13  and  14  and  on  other 
investigations  of  a  similar  nature  discuss  the  practical  applicability  of  the 
present  theory  of  the  probable  deviation. 

While  it  is  not  advisable  to  place  implicit  confidence  in  the 
tests  furnished  by  the  theory  of  probable  deviations  to  the  ex- 
tent that  the  results  which  it  indicates  are  accepted  without  some 
independent  verification,  or  at  least  justification,  yet  when  used 
with  judgment  it  is  an  extremely  valuable  aid  in  practical  statis- 
tical work.  In  every  case  it  establishes  cautionary  limits,  as,  for 
instance,  one  would  not  ordinarily  be  justified  in  concluding 
that  a  variate  with  a  deviation  much  greater  than  two  or  three 
times  the  standard  deviation  belonged  to  the  distribution.  On 
the  other  hand  if  a  number  of  measurements  of  height  should 
each  consistently  exceed  those  of  the  student  distribution  it 
might  then  be  concluded  with  much  certainty  that  the  in- 
dividuals measured  were  taken  from  a  population  distinctly 
different  from  the  student  population.  And  the  conclusion 
would  be  justified  even  tho  the  deviations  were  considerably 
less  than  two  or  three  times  the  standard  deviation. 


CHAPTER  VII. 
THE  CORRELATION  TABLE. 

From  the  records  of  the  physical  measurements  a  tabulation 
was  made  of  the  heights  of  the  students  whose  weight  was  from 
130  to  134  pounds — a  weight  class  which  may  be  denoted  by  the 
middle  weight,  132  pounds  —  and  the  following  distribution 
obtained : 

Height      <!2     63    64    65    66    67     68    69    70    71     72    73    74 
Number      2      5      4     19     18     18     17      8      8      4      3      1      1 

The  distributions  were  likewise  obtained  for  each  other  five- 
pound  interval  from  102  to  187  pounds.  Instead  of  writing  each 
of  these  distributions  separately  it  is  more  convenient  to  write 
them  together  in  one  table  called,  for  reasons  explained  on  page 
73,  a  correlation  table.  In  this  way  we  have  Table  VIII. 

Correlation  Table  of  Height  and  Weight. 

^  HEIGHT  IN  INCHES. 

-'_>& 2_  /2 


61     62    63 

64    65 

66    67 

68    69 

70 

71 

72 

73 

74  To'ls 

102   i  ..      ..       1 

2      *   3 

1 

1 

8 

107      ..       31 

5       2 

1     1 

..      13 

,  ^^$'2 

O 

7      3 

3      3 

2 

..      20 

1  "T117 

2      2 

If)      9 

6      6 

7 

2 

2 

2 

..      48 

122      ..       14 

2     12, 

17     If! 

14 

4 

5 

1 

..       76 

127        1       1       1 

7      7 

11     15 

16 

18 

9 

5 

2 

..      93 

1-7 

H2       ..       2     .. 

4      9 

T8     18 

17 

8 

8 

4 

3 

1 

1      93 

§1 

137 

1     .. 

3       4 

14    20 

2-1 

21 

11 

9 

2 

1    110 

£ 

142 



..       7 

12     10 

17 

17 

8 

15 

5 

2 

..      95 

y. 

— 

1  17 

a 

3      7 

5 

12 

9 

8 

3 

49 

I 

152 

..       2 

2      3 

14 

10 

12 

11 

i 

1      56 

157 

4      1 

6 

7 

5 

7 

1 

..      31 

£ 

162 

1      ..      .. 

o 

2 

3 

8 

2 

2 

2 

..      22 

167 

1 

2 

6 

1 

2 

1 

..      13 

172 

1 

1 

1 

6 

2 

.  . 

..       11 

177 

1 

1 

1        3 

182 

..       1 

1 

2 

187 

1 

3 

2 

2 

1 

•• 

9 

Totals    2     10     11 

:>,x     58 

93  106 

126 

n 

109 

87 

75 

23 

9 

4    750 

TABLE  VIII 

(67) 

_»/ 

68  INTRODUCTION    TO    MATHEMATICAL    STATISTICS 

The  writing  of  the  distributions  in  this  compact  tabular 
form  greatly  facilitates  the  study  and  comparison  of  the  different 
distributions. 

Exercises. 

1.  Notice  that  there  is  a  decided  increase  in  weight  with  an  increase 
in  height ;  that  there  are  no  extremely  tall  persons  in  the  group  who  are  at 
the  same  time  extremely  light  in  weight;  that  there  are  practically  no 
persons  who  are  both  short  and  extremely  heavy. 

2.  Note  that  there  is  a  closer  connection  between  height  and  weight 
for  the   shorter   and  lighter   individuals   than    for   persons   with    medium 
values  of  the  two  characteristics. 


The  Construction  of  a  Correlation  Table.  Let  us  con- 
struct the  correlation  table  of  monthly  precipitation  and  monthly 
mean  temperatures  for  the  Columbus  Station.  The  data  is 
given  under  Exercises  2.  and  3  of  Chapter  III.  Let  the  horizontal 
scale  refer  to  temperatures  and  let  each  class  of  this  scale  have 
a  width  of  five  degrees.  The  vertical  scale  will  then  refer  to 
precipitation  and  let  the  width  of  classes  be  taken  as  one-half 
inch.  The  scales  are  written  across  the  top  and  down  the  left 
hand  margin  respectively  in  order  to  leave  room  for  the  sum- 
mations across  the  bottom  and  down  the  right  hand  margin. 
Under  this  arrangement  of  the  scales  y  increases  in  value  from 
top  to  bottom  and  hence  the  positive  direction  for  y  is  downward. 

In  constructing  the  table  it  is  convenient  to  rewrite  the  data 
according  to  classes  and  at  the  same  time  to  combine  the  two 
distributions.  There  is  no  need  for  retaining  the  dates  but  care 
must  be  taken  that  the  measures  from  exactly  the  same  months 
are  written  together.  This  is  done  by  starting  with  January, 
1879,  and  proceeding  with  the  Januarys  and  then  February, 
1879,  and  so  on  in  order.  The  temperature  figures  are  written 
first  in  each  pair  of  numbers,  and  the  lower  limit  is  written  as 

the  class  number  of  each  class.    Thus      **    refers  to  a  month  with 

*  0 

a  mean  temperature  from  25  to  29  degrees  inclusive  and  with 
a  precipitation  from  1.5  to  1.9  inches  inclusive.  In  this  way 
there  will  be  built  up  a  table  of  the  following  form : 

25         40         20        30        25         20         20         20         25         25 

1.5     4.0    2.0    4.5     3.0    2.0    3.5     4.0    2.0    3.5,  etc. 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS 


Next  the  rulings  must  be  made  for  the  table.  The  tabula- 
tion proceeds  in  the  following  manner:  for  the  first  pair  of 
numbers  find  the  25  column  and  drop  down  this  column  to  the 
precipitation  class  1 . 5  and  mark  a  score ;  then  to  the  40  column 
and  down  to  the  4.0  class  and  tally;  then  to  column  20  and 
precipitation  class  2.0;  etc. 

The  diagram  of  tallies,  usually  dots,  is  called  the  Scatter 
Diagram. 

Correlation  Table  of  Temperature  and  Precipitation. 

PRECIPITATION  IN   INCHES. 


15    20 

25    30    35    40    45    50    55    60    65    70    75To'ls 

0.5 

..      3 

..       2      2      1      1      4      1      1      1      3      1      20 

1.0 

..       2 

4522..       333443      35 

1.5 

1      2c 

478546645    10      4      66 

2.0 
2.5 

2-    I-- 
..      l 

5      5      6      6     (6\  5      3      3      6      7      7      65 
§      5      S      3      3^  f)    ©C§\  SrlS      2      61 

l/j 

3.0 

1 

5      7      2      <    ..   \$     ^M    N$iX-S      &     43 

OS 

3.5 
4.0 

1 

..      1 

522,     6V  2      a/  1      4      3      4      6      38 
2      2  ^   &    ..      2      4      2      4      2      >/  1      30 

u 

4.5 

..      1 

..       6331212..       32      24 

5.0 

2..       441211561      27 

w 

5.5 

..     ..       3      1     ..       1     ..       2      1      3     ..       11 

H 

6.0 

..       1     ..      2     ..     ..       1      1     ..       3      1        9 

6  5 

2     ...       1               3 

u 

7.0 

1      1     ..       1      1      12        7 

H 

H 

7.5 

.  . 

..       1     ..       1     1     3 

> 

8  0 

1       1 

8  5 

11                      2 

9  0 

..     ..     0 

9  5 

1     ..        1 

Totals 

3    16 

31    43    44    42    21    44    30    35    39    73    35    456 

TABLE  IX. 

Table   IX,  the  correlation  table,  is  made  from  the  Scatter 
Diagram  by  inserting  the  frequencies  in  the  place  'of  the  tallies. 

Exercises. 

3.  Do  wet  months  uniformly  occur  with  warm  months?  or  is  there 
more  of  a  tendency  for  wet  and  cold  or  cool  months  to  be  associated? 

4.  What  may  be  said  as  to  the  tendency  for  dry  and  warm  months 
to  be  associated?     for  dry  and  cool  mojnths? 


7O  INTRODUCTION    TO    MATHEMATICAL    STATISTICS 

5.  Does  there  seem  to  be  as  close  a  connection  between  precipitation 
and  temperature  as  between  height  and  weight? 

6.  Is  it  not  possible  that  the  real  connection  between  precipitation 
and  temperature  in  this  table  is   obscured  by  the   fact  that  data   for  all 
four  seasons  is  thrown  together?     Explain. 

Definitions  and  Symbols.  The  properties,  as  height  and 
weight  or  temperatures  and  precipitation  are  called  the  attri- 
butes or  characteristics. 

The  horizontal  deviations  are  called  the  x  classes  or 
deviations,  and  the  vertical,  the  y  classes  or  deviations.  Each 
subclass  or  subgroup  thus  has  a  value  of  x  and  of  3;  associated 
with  it.  It  is  convenient  to  number  the  x  and  y  classes  from 
left  to  right  and  from  top  to  bottom,  respectively,  and  use  these 
numbers  for  class  numbers  instead  of  the  actual  class  values. 
Thus  there  are  17  persons  with  height  66  inches  and  weight 
122  pounds ;  and  4  months  with  a  mean  temperature  of  from  40 
to  45  degrees  and  a  precipitation  of  from  3.0  to  3.5  inches.  In 
terms  of  x  and  y,  the  subclass  x  =6,  y  =  5  has  a  frequency  of  17 ; 
the  subclass  x=  5,  3;  =  6  has  a  frequency  of  4  months. 

The  columns  and  rows  are  spoken  of  as  arrays;  the  col- 
umns as  y-arrays  of  type  x  and  the  rows  as  x-arrays  of  type  y. 
Or  the  concrete  names  of  the  data  may  be  given  to  the  arrays 
-  the  weight  array  of  height  67  inches ;  the  height  array  of 
weight  132  pounds;  the  precipitation  array  of  temperature  40 
degrees,  etc.  It  should  be  noted  that  the  weight  array  of  height 
type  67  inches  is  the  distribution  with  respect  to  weight  of  the 
persons  having  a  height  of  67  inches ;  the  precipitation  array  of 
type  40  degrees  is  the  precipitation  distribution  of  the  months 
having  a  mean  temperature  of  40  degrees. 

A  /y  array  of  type  x  and  an  x  array  of  type  y  are  said  to 
be  arrays  of  opposite  sense.  Two  y  arrays  or  two  x  arrays 
are  arrays  of  the  same  sense. 

The  frequency  of  a  y  array  is  denoted  by  the  symbol  wx 
where  x  is  the  type  of  the  array.  The  frequency  of  an  x  array 
is  denoted  by  the  symbol  ny,  where  3'  is  the  type.  The  frequency 
of  a  subclass  is  denoted  by  the  symbol  nxy,  where  x  and  y  are 
the  deviations  of  the  subclass ;  that  is,  the  types  of  its  two  arrays. 
Thus,  w61  =  2 ;  w132  =  93;  w6tf  .J42  =  12,  or  if  the  simpler  class 
numbers  are  used,  n.^  =  2;  n.7  =  93;  «(!.0  =  12.  When  the  lat- 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS  71 

ter  form  of  class  numbers  is  employed  it  is  necessary  to  dis- 
tinguish, between  x  and  y  class  numbers  by  means  of  a  colon. 
Sometimes  the  distinction  between  x  and  y  deviations  or  class 
numbers  is  made  by  the  use  of  subscripts  as  «X1  y2. 

Exercises. 

7.  Write  the  values  of  «x  for  x  —  2,  4,  9  in  the  precipitation  data. 

8.  Write  the  values  of  n-i-A  ii»:»  for  both  the  height-weight  and  the 
precipitation-temperature  data. 

9.  Practice  stating  the  frequencies  of  the  various  arrays  and  sub- 
groups; e.g.  the  frequency  of  the  weight  array  of  type  8  (68)  is  126. 

10.  Note    that    ;ii:7  +  "^'  +  H™  +    ....    MM*  =  n:i  =  93,    for    the 
height-weight  data. 

11.  Write  other  statements  in  the  form  of  that  of  Exercise  10. 

The  mean  of  the  vertical  column  of  totals  is  called  the 
mean  of  all  the  weights,  and  in  general,  the  mean  of  all  the  y's ; 
and  is  denoted  by  the  symbol  3".  It  is  the  mean  of  the  vertical 
deviations  of  the  variates  when  unclassified  with  respect  to  the 
horizontal  attribute ;  the  mean  weight  for  all  heights ;  the  mean 
monthly  precipitation  disregarding  temperature ;  the  mean 
monthly  precipitation  for  all  temperatures  taken  together. 

Likewise,  the  mean  of  all  the  x's  is  denoted  by  the  symbol 
.v. 

The  means  of  the  weight  arrays  are  denoted  by  the  sym- 
bols, j;fll,  5/62,  y63.  In  general  the  mean  of  the  y-array  of  type 
x  is  denoted  by  the  symbols  yx.  The  mean  of  the  x-array  of 
type  y  is  denoted  by  the  symbol  ,ry. 

Exercises. 

12.  From  the  following  data  construct  the  correlation  table  of  top 
hog  and  top  beef  cattle  prices  at  Chicago. 


72  INTRODUCTION    TO    MATHEMATICAL    STATISTICS 

Chicago  Monthly  Top  Hog  Prices. 

Years.  Jan.    Feb.    Mar.    Apr.    May.  June.    July.    Aug.    Sept.    Oct.    Nov.    Dec. 
1916.   $  8.10  $  8.90  $10.10  $10.05  $10.35  $10.15  $10.25  $11.55  $11.60  $10.35  $10.35  $10.80 


1915 

7.40 

7.25 

7.05 

7.90 

7.95 

7.95 

8.12 

8.05 

8.50 

8.95 

7.75 

7.10 

1914 

8.00 

8.90 

9.00 

8.95 

8.67 

8.50 

9.30 

10.20 

9.75 

9.00 

8.25 

7.75 

1913 

7.80 

8.70 

9.62 

9.70 

8.85 

9.00 

9.62 

9.40 

9.65 

9.10 

8.30 

8.15 

1912 

6.70 

6.57 

7.95 

8.20 

8.05 

7.30 

8.50 

9.00 

9.27 

9.42 

8.30 

7.85 

1911 

8.30 

7.90 

7.35 

6.90 

6.50 

6.72 

7.55 

7.95 

7.80 

6.90 

6.72 

6.60 

1910 

9.05 

10.00 

11.20 

11.00 

9.35 

9.80 

9.60 

9.70 

10.10 

9.65 

8.70 

8.10 

1909 

6.70 

6.95 

7.15 

7.60 

7.55 

8.20 

8.45 

8.32 

8.60 

8.40 

8.45 

8.75 

1908. 

4.72 

4.70 

6.35 

6.45 

5.90 

6.67 

7.10 

7.10 

7.60 

7.20 

6.40 

6.15 

1907. 

7.05 

7.25 

7.10 

6.90 

6.65 

6.42 

6.65 

6.72 

7.00 

7.00 

6.32 

5.30 

1906. 

5.72 

6.42 

6.55 

6.82 

6.67 

6.85 

7.00 

6.75 

6.82 

6.85 

6.50 

6.55 

1905. 

5.00 

5.12 

5.55 

5.72 

5.65 

5.70 

6.17 

6.45 

6.20 

5.80 

5.25 

5.35 

1904 

5.20 

5.30 

5.82 

5.50 

4.95 

5.40 

5.90 

5.80 

6.37 

6.30 

5.25 

4.87 

1903. 

7.10 

7.65 

7.87 

7.65 

7.15 

6.45 

6.10 

6.20 

6.45 

6.50 

5.50 

4.90 

1902 

6.85 

6.60 

6.95 

7.50 

7.50 

7.95 

8.25 

7.95 

8.20 

7.92 

6.95 

6.80 

1901 

5.47 

5.65 

6.20 

6.25 

6.05 

6.30 

6.40 

6.75 

7.37 

7.10 

6.30 

6.90 

1900 

4.92 

5.10 

5.55 

5.85 

5.57 

5.42 

5.55 

5.57 

5.70 

5.55 

5.12 

5.10 

1899 

4.05 

4.05 

4.00 

4.15 

4.05 

4.00 

4.70 

5.00 

4.90 

4.90 

4.35 

4.45 

1898 

4.00 

4.27 

4.17 

4.15 

4.80 

4.50 

4.17 

4.20 

4.15 

4.00 

3.85 

3.75 

1897 

3.60 

3.75 

4.25 

4.25 

4.05 

3.65 

4.00 

4.55 

4.65 

4.40 

3.80 

3.60' 

1896 

4.45 

4.35 

4.35 

4.15 

3.75 

3.60 

3.70 

3.75 

3.50 

3.65 

3.67 

3.65 

1895 

4.80 

4.65 

5.30 

5.42 

4.97 

5.10 

5.70 

5.40 

4.65 

4.50 

3.85 

3.75 

Chicago  Monthly  Top  Beef  Cattle  Prices. 


Years.  Jan. 

Feb. 

Mar. 

Apr.    May.  June. 

July. 

Aug. 

Sept. 

Oct. 

Nov. 

Dec 

1916 

$9.85  $ 

!  9.75  $10.05  $10.00  $10.90  $11.50 

$11.30 

$11.50  $11.50  $11.65 

$12.40 

$13.00 

1915 

9.70 

9.50 

9.15 

8.90 

9.65 

9.95 

10.40 

10.50 

10.50 

10.60 

10.55 

11.60 

1914 

9.50 

9.75 

9.75 

9.55 

9.60 

9.45 

10.00 

10.90 

11.05 

11.00 

11.00 

11.40 

1913 

9.50 

9.25 

9.30 

9.25 

9.10 

9.20 

9.20 

9.25 

9.50 

9.75 

9.85 

10.25 

1912 

8.75 

9.00 

8.85 

9.00 

9.40 

9.60 

9.85 

10.65 

11.00 

11.05 

11.00 

11.25 

1911 

7.10 

7.05 

7.35 

7.10 

6.50 

6,75 

7.35 

8.20 

8.25 

9.00 

9.25 

9.35 

1910 

8.40 

8.10 

8.85 

8.65 

8.75 

8.85 

8.60 

8.50 

8.50 

8.00 

7.75 

7.55 

1909 

7.50 

7.15 

7.40 

7.15 

7.30 

7.50 

7.65 

8.00 

8.50 

9.10 

9.25 

9.50 

1908 

6.40 

6.25 

7.50 

7.40 

7.40 

8.40 

8.25 

7.90 

7.85 

7.65 

8.00 

8.00 

1907 

7.30 

7.25 

6.90 

6.75 

6.50 

7.10 

7.50 

7.60 

7.35 

7.45 

7.25 

6.35 

1906 

6.50 

6.40 

6.35 

6.35 

6.20 

6.10 

6.50 

6.85 

6.95 

7.30 

7.40 

7.90 

1905 

6.35 

6.45 

6.35 

7.00 

6.85 

6.35 

6.25 

6.50 

6.50 

6.40 

6.75 

7.00 

1904 

5.90 

6.00 

5.80 

5.80 

5.90 

6.70 

6.65 

6.40 

6.55 

7.00 

7.30 

7.65 

1903 

6.85 

6.15 

5.75 

5.80 

5.65 

5.15 

5.65 

6.10 

6.15 

6.00 

5.85 

6.00 

1902 

7.75 

7.35 

7.40 

7.50 

7.70 

8.50 

8.85 

9.00 

8.85 

8.75 

7.40 

7.75 

1901 

6.15 

6.00 

6.25 

6.00 

6.10 

6.55 

6.40 

6.40 

6.60 

6.90 

7.25 

8.00 

1900 

6.60 

6.10 

6.05 

6.00 

5.85 

5.90 

5.85 

6.20 

6.15 

6.00 

6.00 

7.50 

1899 

6.30 

6.25 

5.90 

5.85 

5.75 

5.75 

6.00 

6.65 

6.90 

7.00 

7.15 

8.25 

1898 

5.50 

5.85 

5.80 

5.50 

5.50 

5.35 

5.65 

5.75 

5.85 

5.90 

6.25 

6.25 

1897 

5.50 

5.40 

5.65 

5.50 

5.45 

5.30 

5.25 

5.50 

6.00 

5.40 

6.00 

5.65 

1896 

4.00 

4.75 

4.75 

4.75 

4.55 

4.65 

4.60 

5.00 

5.30 

5.30 

5.45 

6.50 

1895 

5.80 

5.80 

6.60 

6.40 

5.25 

6.00 

6.00 

6.00 

6.00 

5.60 

5.00 

5.50 

13.  From  the  data  of  Exercise  12,  construct  the  correlation  table  of 
hog  prices  and  months  of  the  year. 

14.  From  data  obtained  from  a  financial  journal  construct  a  correla- 
tion table  of  the  prices  of  common  and  preferred  stocks. 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS  73 

15.  In  the  correlation  table  of  Exercise  12  does  there  appear  to  be 
a  sharp  tendency  for  the  beef  cattle  arrays  to  vary  with  the  changing  live 
hog  prices?    Is  the  tendency  more  pronounced  at  some  parts  of  the  table 
than  at  others? 

16.  Compare  the  tendencies   for  close  connection  between   the   at- 
tributes in  the  table  of  Exercise  13  with  that  in  Exercise  12. 

Correlation.  In  the  table  of  student  heights  and  weights 
there  is  a  decided  tendency  for  heaviness  and  tallness  to  be 
associated  and  for  lightness  and  shortness  to  be  associated. 
There  is  likewise  a  pronounced  tendency  for  the  prices  of 
live  hogs  and  beef  cattle  to  vary  together.  It  is  to  be  noted 
that  the  two  series  of  measurements  do  not  vary  together  in 
every  case ;  that  is,  there  are  months  in  which  the  price  of 
hogs  is  low  but  the  price  of  beef  high.  But  when  all  the 
months  of  an  array  are  taken  together  the  general  tendency 
for  the  progressive  increase  of  beef  cattle  prices  with  each 
increase  of  hog  prices  is  evident.  Two  characteristics  are  said 
to  be  correlated  when  there  is  a  tendency  for  the  changes  in 
the  value  of  one  to  depend  on  the  changes  in  the  value  of  the 
other.  The  two  characteristics  may  increase  together  or  one  may 
increase  while  the  other  decreases  and  even  in  a  part  of  the  table 
the  movement  of  the  changes  may  be  together  and  in  another 
part  the  two  series  of  changes  may  move  in  opposition ;  the  es- 
sential evidence  for  the  presence  of  correlation  is  that  the  meas- 
urements change  from  array  to  array. 

In  uncorrelated  data  there  is  no  tendency  for  the  distribu- 
tions of  the  arrays  to  change  from  type  to  type. 

In  perfectly  correlated  data  there  is  an  exact  connection 
between  the  values  of  the  two  characteristics.  If  height  and 
weight  were  perfectly  correlated,  for  instance,  all  persons  of  a 
given  height,  say  68  inches,  would  be  of  the  same  weight  and 
hence  all  the  frequencies  of  the  weight  array  of  type  68  would 
lie  within  a  single  subgroup.  Between  the  two  extremes  of  per- 
fect and  of  no  correlation  there  are  all  degrees  of  correlation. 

Exercises. 

17.  Study    the    degrees    of    correlation    shown    by    the    tables    con- 
structed in  working  the  exercises  of  this  chapter. 

18.  Is  it  possible  to  find  actual  data  which  shows  absolutely  no  cor- 
relation?    Construct  an  imaginary  table  which  shows  no  correlation. 


CHAPTER  VIII. 
THE  CORRELATION  RATIO. 

The  Mean  as  Representative  of  the  Array.  In  Chapter 
IV  it  was  stated  that  the  modal  deviation  is  the  most  frequent 
deviation ;  that  is,  the  most  typical  deviation  of  a  distribution. 
Because  the  mode  cannot  be  computed  by  a  simple  and  uniform 
process  of  arithmetic  the  mean  is  a  more  practicable  representa- 
tive of  the  array.  And  this  substitution  of  the  mean  for  the 
mode  will  rarely  produce  a  serious  error. 

Since  the  mean  of  the  frequencies  of  an  array  is  taken  as 
the  representative  of  the  deviations  of  the  array,  from  the  defi- 
nition of  correlation  on  page  73  it  is  apparent  that  the  amount 
or  degree  of  correlation  in  the  data  will  be  indicated  by  the  varia- 
tion in  the  means  from  array  to  array. 

Regression  Curves.  The  variation  in  the  means  of  the 
arrays  is  shown  graphically  by  the  curve  of  means,  which  is 
called  a  regression*  curveXfrv  */uvl  •  •**'*****}  uJ^^'^vvvf 

Since  there  are  two  sets  of  arrays  there  are  two  regression 
curves. 

Coordinate  Axes.  It  is  usual  to  take  for  the  horizontal 
or  .i--axis  the  horizontal  line  thru  the  mean  of  all  the  /s;  that  is, 
the  horizontal  line  at  a  distance  y  below  the  base  line  of  the 
table,  and  for  the  3'-axis  the  vertical  line  distant  ~r  from  the  left 
marginal  vertical.  The  point  of  intersection  of  these  two  lines 
is  called  the  center  of  the  table.  Deviations  to  the  right  are 
taken  positive  and  those  to  the  left  negative;  deviations  down- 
ward from  the  new  horizontal  axis  are  positive  and  deviations 
upward  are  considered  negative.  Sometimes  this  convention  of 
plus  downward  and  negative  upward  is  departed  from.  No  con- 
fusion can  result  however  if  it  is  remembered  that  the  directions 
in  which  an  attribute  is  increasing  is  always  taken  as  positive. 


*  So  called  by  Francis  Galton  for  certain  reasons  which  arose  in  his 
investigations  in  biology.     The  name  has  become  general. 

(74) 


1  NT  KOI)  I   (  Tl<>.\     TO     M  AT  1 1  KM  AT  1C AL    STATISTICS  /5 

Exercises. 

1.  Draw  the  axes  and  regression  curves  for  each  of  the  correlation 
tables  of  Chapter  VII. 

2.  Study  and  compare  the  forms  of  the  regression  curves  of  Ex- 
ercise 1. 

Correlation  and  the  Regression  Curves.  In  uncorrelated 
data  the  means  of  an  array  does  not  depend  on  the  type  of  the 
array ;  that  is,  does  not  change  from  array  to  array,  and  hence  the 
unchanging  value  of  the  means  must  be  the  same  as  the  mean  of 
all  the  y's. 

The  regression  curve  for  uncorrelated  data  therefore  ap- 
proximates a  straight  line  coinciding  with  the  horizontal  axis. 
For  correlated  data  the  regression  curve  diverges  or  deviates 
from  this  position  of  coincidence  with  the  axis.  It  must  be  noted 
that  the  shape  of  the  regression  curve  may  be  quite  irregular 
without  effect  on  the  degree  of  correlation  present  in  the  data; 
it  is  the  distance  of  the  means  from  the  axis  that  counts  in  de- 
termining the  degree  of  correlation  present.  Hence  any  numeri- 
cal measure  of  the  extent  of  correlation  in  the  data  must  de- 
pend on  the  "deviation  of  the  means  from  the  horizontal  axis 
thru  the  center. 

Since  there  are  two  regression  curves  and  two  axes  there 
are  two  correlations  in  each  correlation  table  and  their  numerical 
measures  involve  the  deviations  of  the  respective  regression 
curve's  from  the  corresponding  straight  lines  thru  the  center. 
Thus  the  dependence  of  height  on  weight  and  of  weight  on 
height  are  two  distinct  correlations. 

Mean  Squared  Deviation  of  the  Means  of  Arrays.  The 
mean  squared  deviation  is  the  most  convenient  measure  of  the 
deviation  of  the  means  of  the  arrays.  In  computing  this  the 
means  of  the  arrays  are  first  written  in  a  vertical  column  and 
then  the  difference  between  each  mean  and  the  mean  of  all  the 
variates  is  set  down  in  a  second  column.  Because  the  differences 
are  used  only  in  the  squared  form  it  is  not  necessary  to  retain  a 
negative  sign. 

The  third  column  in  the  computations  of  Table  X,  page 
77,  contains  the  squares  of  the  differences.  Since  the  means 
of  the  array  are  used  as  the  representatives  of  the  individuals 
of  the  respective  arrays  each  of  these  individuals  is  possessed  of 


76  INTRODUCTION    TO    MATHEMATICAL    STATISTICS 

the  squared  deviations.  Hence  each  square  must  be  multiplied 
by  the  respective  frequency  of  the  corresponding  array.  The 
resultant  products  form  the  fourth  column.  The  sum  of  this 
fourth  column  is  the  total  sum  of  squared  deviations  and  this 
sum  divided  by  the  total  frequency  is  the  mean  squared 
deviation. 

The  Correlation  Ratio.  The  mean  squared  deviation  just 
obtained  would  be  a  significant  measure  of  correlation  were  it 
not  for  the  fact  that  it  does  not  take  into  account  the  disper- 
sion of  the  data  as  a  whole  Without  changing  the  mean  and 
the  frequency  of  a  single  r-array,  it  would  be  possible  to 
spread  out  each  array  to  twice  its  length.  This  alteration 
would  concern  the  dispersion  of  the  data  as  a  whole  but 
would  leave  the  mean  square  deviation  from  the  horizontal 
axis  unchanged.  It  is  evident  that  the  value  of  the  mean 
square  deviations  of  the  means  of  the  arrays  is  of  less 
significance  in  the  more  spread  out  data.  Hence  the  disper- 
sion of  the  data  as  a  whole  must  be  considered  in  interpret- 
ing the  value  of  the  mean  squared  deviation.  The  dispersion 
of  the  data  as  whole  is  given  by  the  standard  deviation  of 
the  frequencies  of  the  totals  in  the  vertical  sum  column.  The 
smaller  this  mean  square  deviation  the  more  significant  is  the 
deviation  of  the  means,  and  the  larger  this  standard  deviation 
the  less  significant  the  deviation  of  the  means.  It  is  therefore 
reasonable  to  divide  the  square  root  of  the  mean  square  deviation 
of  the  means  of  the  arrays  by  the  standard  deviation  from  the 
marginal  column.  The  quotient  is  called  the  correlation  ratio,  and 
is  denoted  by  the  Greek  letter  77. 

The  computation  of  the  correlation  ratio  for  the  dependence 
of  student  weight  on  height  follows. 

A  carefully  planned  outline  scheme  of  computation  must 
be  made  before  the  figures  are  entered. 

The  means  and  the  one  standard  deviation  were  computed 
in  the  usual  manner.  We  have,  for  the  data  as  a  whole, 
5  =  7.9,  (j2  =  g.^.  The  means  of  the  arrays  are  written  in 
the  second  column  just  after  the  frequencies.  The  differences 
between  the  means  and  y  follow  in  the  third  column.  The 
squares,  and  the  product  of  the  squares  by  the  frequencies  are 
the  fourth  and  fifth  columns  respectively.  The  symbols  ex- 
plained in  Chapter  VII  are  written  at  the  head  of  each  column. 


INTRODUCTION    To    MATHEMATICAL    STATISTICS  77 

Computation  of  17. 


nx 

yx 

y—  yx 

(y  —  yx)2  i 

ix  (y  — 

2 

9.5 

1.6 

2.56 

5.12 

10 

4.7 

3.2 

10.24 

102.40 

11 

4.4 

3.5 

12.25 

134.75 

38 

4.6 

3.1 

9.61 

365.18 

53 

6.1 

1.8 

3.24 

184.68 

93 

6.8 

1.1 

1.21 

112.53 

106 

6.9 

1.0 

1.00 

106.00 

126 

8.0 

0.1 

0.01 

1.26 

109 

8.8 

0.9 

0.81 

89.19 

87 

9.7 

1.8 

3.24 

281.88 

75 

10.9 

3.0 

9.00 

675.00 

23 

10.3 

2.4 

5.76 

132.48 

9 

11.1 

3.2 

10.24 

92.16 

4 

10.5 

2.6 

6.76 

27.04 

750 

2309.67 

2309.67 

2  

—  o  3146 

"n  ~ 

750X9.79 

—. 

0.56 

TABLE  X. 

Exercises. 

3.  Compute  the  value  of  ^  for  the  dependence  of  monthly  precipi- 
tation  upon   monthly  mean   temperature   as   shown  by  the   data   of  the 
Columbus  Weather  Station. 

4.  Compute  the  value  of  ^  for  the  correlation  of  Chicago  top  hog 
prices  with  Chicago  top  beef  cattle  prices  as  shown  in  the  table  of  Ex- 
ercise 12  of  the  preceding  chapter. 

5.  Compute  values  of  f\  from  the  tables  of  Exercises  13  and  14  of 
Chapter  VII. 

Two  Values  for  77  in  Each  Table.  From  the  method  of 
computation  it  is  clear  that  there  are  two  values  for  77  in  each 
correlation  table,  one  for  each  regression  curve.  The  cor- 
relation ratio  of  weight  with  height,  for  instance,  may  differ  con- 
siderably from  the  correlation  ratio  of  height  with  weight; 
the  dependence  of  precipitation  on  temperance  may  be  of  a 
decidedly  different  degree  from  that  of  temperature  on  pre- 
cipitation. The  two  values  of  77  do  not  ordinarily  differ  markedly 
but  there  can  be  no  apriori  assurance  that  they  will  be  essentially 
of  equal  value  and  hence  it  is  necessary  to  compute  the  two  values 
separately  in  case  both  are  desired.  To  distinguish  the  two 


/5  INTRODUCTION    TO    MATHEMATICAL    STATISTICS 

measures,  for  the  dependence  of  y  on  .r,  of  weight  on  height,  the 
symbol  rjy  is  used  and  the  symbol  ?/x  refers  to  the  dependence 
of  x  on  y. 

Exercises. 

0.  Compute  the  value  of  77  for  the  correlation  of  height  with 
weight  and  compare  with  the  other  value  of  -n  computed  on  page  77. 

7.  Compute    the    value    of    ^x    from    the    precipitation-temperature 
correlation  table,  and  compare  the  values  of  ™x  and  <ny. 

8.  Compute  the  value  of  ^x  for  the  live  stock  price  table  of   Ex- 
ercise 12,  Chapter  VII,  and  compare  with  the  value  of  ™v  from  the  same 
table. 

Limiting  Values  of  the  Correlation  Ratio.  In  theory,  the 
means  lie  exactly  on  the  axis  for  data  of  zero  correlation. 
Each  separate  item,  therefore,  in  the  mean  square  deviation 
of  the  means  is  zero  and  hence  rj  is  zero  for  zero  correlation. 

Because  each  term  of  the  mean  squared  deviation  of  the 
means  is  squared  and  hence  necessarily  positive  any  accidental 
fluctuations  of  the  means  of  the  arrays  in  data  of  essentially 
zero  correlation  increase  the  value  of  rj.  Since  there  are  no  com- 
pensating flunctuations,  the  result  is  that  small  values  of  tj  are 
quite  likely  to  be  too  large  and  hence  the  statistical  significance 
of  rj  for  data  of  a  small  degree  of  correlation  is  open  to  ques- 
tion. The  degree  of  correlation  in  such  cases  cannot  be  greater 
than  the  value  of  77  would  indicate  and  it  may  be  less.  It  must 
be  evident  from  the  nature  of  the  error  that  for  material  show- 
ing a  considerable  degree  of  correlation  the  error  from  this 
source  is  negligible. 

As  discussed  in  Chapter  VII,  in  perfectly  correlated  data 
the  frequencies  of  each  array  are  concentrated  in  a  single  class 
or  subgroup  of  the  array.  According  to  the  shape  of  the  re- 
gression curve  two  cases  can  arise  for  perfectly  correlated  data, 
as  the  following  imaginary  distributions  illustrate. 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS  79 

In  the  first  distribution  it  is  evident  that  the  mean  squared 
deviation  of  the  means  is  obtained  by  exactly  the  same  numerical 
work  and  by  the  use  of  the  same  numbers  as  is  the  standard 
deviation  of  all  the  y's  and  hence  rj,  which  is  the  ratio  of  these 
two  measures,  must  be  equal  to  unity.  A  study  of  the  second 
distribution  leads  to  the  conclusion  that  in  this  case  also  the  two 
measures  of  deviation  are  equal  and  hence  r/  is  unity  as  it  is  in 
the  first  distribution. 

Exercises. 

9.  Compare    the    values    of    rj    that    have   been    computed   with   the 
general  appearance  of  correlation  in  the  tables. 

10.  Can   a   tendency  be   detected    for   the  two   values   of  "n   to  be 
closer   together    in    value    for    highly    correlated    data    than    for    data    of 
smaller  correlation? 

11.  Describe  in  general  terms   the   two  species  of  perfect  correla- 
tion illustrated  above. 

Probable  Deviation  of  the  Correlation  Ratio.  It  can  be 
proved  that  the  probable  deviation  of  a  correlation  ratio  is  given 
by  the  formula 

P.  E.  77  =  0.6745  — 


Exercises. 

12.  Compute  the  probable  deviations  of  the  correlation  ratios  of 
this  chapter. 

The  probable  error  is  of  much  practical  use  in  estimating 
the  significance  of  values  of  77  tho  of  course  the  facts  of  its 
theoretical  derivation  must  always  be  kept  in  mind.  It  is 
assumed  in  the  derivation  of  the  formula  that  the  data  is 
strictly  homogeneous  thruout;  that  all  the  fluctuations  from 
sample  to  sample  are  merely  those  of  random  sampling;  that 
the  regressions  are  truly  linear  ;  and  that  each  array  has  the 
same  spread. 

In  working  with  correlations,  especially  where  the  total 
frequencies  are  not  large,  it  is  always  well  to  obtain  a  con- 
siderable number  of  distributions.  Then  if  there  proves  to  be 
a  consistency  in  the  value  of  77  greater  confidence  can  be  placed 
in  those  values  than  if  there  was  only  one  distribution.  Thus 
if  fifty  groups  of  750  students  were  each  measured  for  height 


So  INTRODUCTION    TO    MATHEMATICAL    STATISTICS 

and  weight  and  the  computed  values  for  y  should  show  a  de- 
cided tendency  to  agree  in  value,  increased  weight  could  be  given 
to  these  values  of  rj. 

Spurious  Correlation.  In  interpreting  the  computed  value 
of  any  measure  of  correlation  care  must  be  taken  that  the  cor- 
relation is  not  merely  apparent  and  due  to  the  method  of  obtain- 
ing the  data.  Wages  and  general  prices  appear  to  be  highly 
correlated  but  how  much  of  this  apparent  connection  is  due  to 
the  fact  that  both  are  expressed  in  terms  of  money  which  has 
cheapened  and  consequently,  at  least  to  some  extent,  caused 
both  wages  and  general  prices  to  increase.  When  money  is 
becoming  cheaper  both  wages  and  prices  tend  in  general  to  rise 
together ;  when  money  is  becoming  dearer  both  wages  and  prices 
tend  in  general  to  fall  together.  Such  variations  in  the  pur- 
chasing value  of  money  could  thus  introduce  a  considerable 
element  of  apparent  connection  between  the  two  attributes- 
wages  and  general  prices  —  even  when  there  was  no  tendency 
whatever  for  real  wages  to  change,  that  is,  no  change  in  the 
amount  of  goods  that  the  laborer  could  purchase  with  his  wages. 

Exercises. 

13.  In   which   of  the   correlation   of   this   chapter   is   there  a  possi- 
bility of  spurious  correlation? 

14.  Show  that   in  correlating  index  numbers   especial   care   is  nec- 
essary in  interpreting  values  of  'n. 

15.  Show   that   where   there   is  an   element  of   spurious  correlation 
present  the  correlation  is  real  in  so  far  as  the  measurements  themselves 
are  concerned. 


CHAPTER  IX. 
THE  COEFFICIENT  OF  CORRELATION. 

Linear  Regression.  A  straight  line  fitted  to  the  means 
of  the  arrays  is  called  a  line  of  regression.  A  line  of  regres- 
sion smooths  the  curve  of  regression.  Whenever  a  curve  of 
means  approximates  a  straight  line  the  regression  is  said  to 
be  sensibly  linear.  If  the  regression  curve,  within  the  limits 
of  accuracy  of  the  data,  is  exactly  a  straight  line  the  regres- 
sion is  said  to  be  truly  linear. 

The  slope  of  a  regression  line  shows  the  broad  general 
tendencies  of  the  connection  between  the  attributes.  Does 
weight  tend  to  increase  as  height  increases?  Does  the  monthly 
precipitation  increase  with  an  increase  ol  temperature?  If  so 
at  about  what  rate?  These  are  questions  that  depend  for  an 
answer  on  the  slopes  of  the  regression  curves.  It  may  happen 
that  in  some  correlation  tables  the  regression  curves  deviate 
in  form  so  widely  from  a  straight  line  that  the  line  of  regression 
has  but  little  significance;  in  such  cases  the  usefulness  of  any 
statement  of  general  tendencies  is  open  to  question. 

Exercises. 

1.  Draw  by  inspection  the  regression  lines  on  the  correlation  table 
of  student  heights  and  weights. 

2.  In   Exercise   1   estimate  the   comparative  degrees  of   correlation 
shown  by  the  two  regression  lines. 

The  Equations  of  the  Lines %  of  Regression.  Let  the 
coordinate  axes  be  the  two  lines  thru  the  center  determined  by 
the  means  of  all  the  variates  as  described  on  page  74.  Then 
fx  and  x  are  the  coordinates  of  a  point  on  the  one  regression 
line  and  jry  and  y  on  the  other.  It  must  be  understood,  however, 
that  the  values  of  fx  and  .vy  here  referred  to  are  the  adjusted  or 
fitted  means  of  the  arrays  so  that  unless  the  regressions  are  truly 
linear  these  values  will  differ  from  the  values  obtained  by  actual 
computation. 

(81) 


82  INTRODUCTION    TO    MATHEMATICAL    STATISTICS 

It  is  demonstrated  in  Chapter  XII  that  the  equation  of  the 
regression  line  of  the  means  of  the  y  arrays  is 


And  of  the  means  of  the  x  arrays, 

^>  =  r — y, 

°y 

where  o-y  is  the  right  hand  marginal  standard  deviation  and  <rx 
the  bottom  marginal  standard  deviation.     The  constant  r  is  de- 

2  2  nxy.  x.y 
fined     by     the     equation,     r=  —  — .       The     expression 


-a^y  is  a  symbolic  way  of  saying:  the  sum  obtained  by 
multiplying  the  frequency  of  each  class  by  its  deviation  from 
the  horizontal  axis  and  then  by  its  deviation  from  the  vertical 
axis  and  then  adding  the  products. 

According  to  the  first  of  the  two  regression  equations  the 
mean  weight  for  height  71  is  obtained  by  substituting  the  value 
for  x  measured  from  the  mean  and  multiplying  and  divid- 
ing as  the  formula  directs.  We  found  that  for  this  data  of 
student  measurements  ory  =  2.36  and  <rx  =  3.14.  The  value  of  r 
is  found  presently  to  be  0.56.  The  array  is  distant  from  the 
mean  3.1.  Hence, 


3.1 


3-14 
=  1.31,  weight  classes  from  the  mean  weight. 

This  use  of  the  regression  lines  to  estimate  the  position  of  the 
means  is  often  of  practical  value. 

The  Coefficient  of  Correlation.  Let  us  now  compute  the 
correlation  ratio  using,  however,  in  case  the  regression  is  not 
truly  linear,  not  the  actual  means  of  the  array  but  the  means 
given  by  the  regression  line.  The  deviation  of  a  mean  from 

°V 

the  horizontal  axis  is  r  . .  .r.     The  square  of  this  quantity 

Ox 

multiplied    by    the    frequency    of    the     array    is 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS  83 

The  last  factor  is  the  same  for  each  array  and  the  sum  of  the 
other  factors  leads  to  the  standard  deviation  of  all  the  x's. 
Hence  we  have,  on  carrying  out  the  multiplications  for  each 

°V2  7-V  -' 

array  and  adding,  r2  —  .  £;/x  .  x-  =  --  1-.  No-*2  =  Nr2vy-.   There- 

J7y2  ^X 

fore  the  mean  squared  deviation  of  the  regression  means  of  the 

AV-o-y2 
arrays  is    -  =  rVy2.     On  dividing  this  mean  squared  de- 

N 

viation    by    the    standard    deviation    of    all    the    yfs    we    have 

f~0 

That    is,    the    constant    r    turns    out    to    be    the 


correlation  ratio  when  the  regression  means  are  used  instead 
of  the  true  means.  It  is  called  the  Coefficient  of  Correlation. 

Computation  of  r.  For  computation  purposes  the  sum- 
mation 2x25y«-xy;y.ar  can  be  arranged  in  the  following  manner.  Let 
the  subgroup  frequencies  of  a  given  y  array  be  each  multiplied 
by  the  respective  deviations,  all  deviations  being  measured  from 
the  axis  thru  the  center,  and  the  products  summed.  Divided 
by  the  frequency  of  the  array  this  sum  gives  the  mean  vx. 
Hence  the  summation  for  the  array  is  equal  to  the  product  of 
the  mean  yK  and  the  frequency  wx.  On  making  this  substitution 
the  original  summation  formula  becomes  2,nK.yK.x,  or 
Swx(;yx  —  y)  (x  —  ~x)  from  the  original  axes. 

In  the  course  of  the  computation  of  the  correlation  ratio 
the  means  yx  are  obtained  and  hence  to  the  computation  sched- 
ule of  page  77  only  the  additional  column  for  the  x  deviations 
of  each  array  is  needed.  Then  the  multiplication  of  the  corre- 
sponding values  from  the  wx,  (yK  —  y),  and  (x  —  "x)  columns 
gives  the  column  which  sums  into  the  quantity  2nx(vx  —  y)  (x  —  ~r). 
This  sum  divided  by  the  product  of  the  three  factors  N,  <rx,  and 
o-y  gives  the  required  value  for  r. 

Table  XI  which  follows  shows  the  computation.  The 
value  of  o-y  is  found  in  Exercise  10,  Chapter  V,  to  be  2.31,  and 

•k—  3-13- 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS 

Computation  of  r. 

3>x         3V- y  nx        *— *      »x(x— *)  NXO— 


1 

9.5 

—1.6 

2 

—6.9 

—13.8 

22.08 

2 

4.7 

—3.2 

10 

—3.9 

—29.0 

92.8 

3 

4.4 

—3.5 

11 

—4.9 

—53.9 

188.65 

4 

4.6 

—3.1 

38 

—3.9 

—148.2 

459.42 

5 

6.1 

—1.8 

57 

—2.9 

—165.3 

297.54 

6 

6.8 

—1.1 

93 

—1.9 

—176.7 

194.37 

7 

6.9 

—1.0 

106 

—0.9 

—93.4 

93.4 

8 

8.0 

+0.1 

126 

+0.1 

+12.6 

1.26 

9 

8.8 

+0.0 

109 

+1.1 

+119.9 

0. 

10 

9.7 

+1.8 

87 

+2.1 

+182.7 

328.86 

11 

10.9 

+3.0. 

75 

+3.1 

+232.5 

697.5 

12 

10.3 

+2.4 

23 

+4.1 

+94.3 

226.32 

13 

11.1 

+3.2 

9 

+5.1 

+45.9 

146.88 

14 

10.5 

+2.6 

4 

+6.1 

+24.4 

63.44 

S»x(*—  F)    (y—  50  =3018.4 
x  —  50    (*  —  *) 


=  0.55 

TABLE  XI. 

Exercises. 

3.  Compute  the  value  of  r  for  the  monthly  precipitation  and  tem- 
perature data. 

4.  Compute  r  for  the  top-hog-  and  top-beef-cattle  data. 

5.  Compare  the  values  o'f  r  in  Exercises  3  and  4  and  in  the  weight- 
height  data  with  the  corresponding  values  for  -n. 

6.  Compute  the  value  of  r   from  the  monthly  price-of-hogs  'data 
of  >  Exercise   12,   Chapter  VII.     Compare  with   the  corresponding  value 
for  "n. 

7.  Does  there   seem  to  be  a  tendency  for  >?  and  r  to  agree  more 
closely    for   highly  correlated   data   than    for    material    of   small    correla- 
tion ? 

8.  Compare  the  amount  of  labor  involved  in  the  computation  of  r 
with  that  involved  in  the  computation  of  *7. 

The  Relation  of  r  to  77.  In  data  exhibiting  a  regression 
which  is  truly  linear  the  value  of  77  is,  of  course  identical  with 
that  of  r.  In  the  case  of  any  but  truly  linear  regression  it  is 
readily  shown  in  Chapter  XII  that  the  value  of  r  is  necessarily 
less  than  that  of  77.  In  fact  if  the  regression  curve  is  of  a 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS  85 

certain  shape  the  value  of  r  will  be  very  small  even  tho  prac- 
tically perfect  correlation  exists. 

Unlike  the  correlation  ratio  the  coefficient  of  correlation 
expresses  a  property  of  the  correlation  table  as  a  whole  and  not 
merely  of  one  or  the  other  of  the  two  correlations  of  the  table. 

Again,  unlike  the  correlation  ratio,  the  negative  sign  ob- 
tained in  the  final  extraction  of  the  square  root  (in  the  dis- 
cussion of  page  76)  has  a  significance ;  it  indicates  that  the 
regression  line  has  a  negative  slope  and  hence  that  the  con- 
nection between  the  attributes  is  inverse;  that  is,  one  attribute 
increases  while  the  other  decreases. 

Because  both  positive  and  negative  values  of  r  can  occur 
there  is  no  tendency,  as  there  is  in  the  case  of  >/,  for  small  values 
of  r  to  be  larger  than  the  actual  degree  of  correlation  would 
warrant.  * 

Limiting  Values  for  r.  In  data  of  zero  correlation  it  is 
clear  that  the  regression  line  coincides  with  the  axis  and  hence 
the  value  of  r  must  be  zero. 

Reasoning  from  the  relation  of  r  to  ^  we  see  that  for  truly 
linear  regression  perfect  correlation  leads  to  a  value  of  r  equal 
to  unity.  The  unity  value  for  r  will  be  positive  or  negative 
according  to  the  correlation  is  direct  or  inverse.  According  to 
the  underlying  theory  of  the  coefficient  of  correlation  for  data  in 
which  a  regression  is  not  linear  the  value  of  r  cannot  be  unity 
even  tho  there1  is  perfect  correlation  and  hence  r  is  necessarily 
smaller  in  value  than  the  degree  of  correlation  would  require. 

Statistical  Properties  of  the  Coefficient  of  Correlation. 
The  coefficient  r  is,  as  the  preceding  discussions  show,  a  con- 
servative measure  of  correlation.  In  periodic  data  exhibiting  a 
sinusoid  form  for  the  regression  curve  the  correlation  may  be 
high  but  because  the  departure  of  the  regression  from  linearity 
is  so  wide  the  value  of  r  understates  the  correlation  and  hence 
its  applicability  in  such  data  is  not  of  importance. 

The  characteristic  importance  of  the  coefficient  r  is  in  de- 
fining the  slope  of  the  regression  lines.  It  furnishes  the  most 
convenient  method  for  defining  the  general  tendencies  in  the 
data.  The  rise  of  prices,  for  instance,  during  the  last  fifteen 
years  can  be  readily  measured  by  the  rate  of  rise  of  the  regres- 
sion line. 


86  INTRODUCTION    TO    MATHEMATICAL    STATISTICS 

Therefore  for  the  single  purpose  of  measuring  correlation 
the  coefficient  of  correlation  is  distinctly  inferior  to  the  corre- 
lation ratio  both  in  convenience  and  reliability.  It  should  never 
be  used  as  a  measure  of  correlation  without  first  carefully  test- 
ing the  form  of  the  regression.  It  does  have  however  the  highly 
useful  property  of  giving  the  slopes  of  the  regression  lines. 

Test  for  Linearity  of  Regression.  It  would  be  suspected 
from  the  preceding  theory  and  discussion  that  the  difference 
between  TJ  and  r  should  be  an  indicator  of  the  departure  of  the 
regression  from  linearity.  A  somewhat  more  convenient  meas- 
ure of  this  departure  than  the  simple  difference  is  the  difference 
of  the  squares  of  77  and  r. 

Probable  Deviations.  The  following  probable  deviations 
can  be  derived. 

(i—r-) 
p.  R.  of  r  =  0.6745  - 


P.  E.  of   (rj-  —  r-)  =  -      -  yy__ra 

Vi? 

A  practical  criterion  of  linearity  is  to  assume  linearity  when 


—  Vr  —  r-  <  2.5. 
f-35 

8.     Compute   the   regression    equations    for   each    of   the   correlation 
tables  of  Chapter  VII. 

10.  How  can  the  value  of  r  be  obtained  graphically   from   the  re- 
gression lines?     Is  this  a  practicable   method  of  finding  the  value  of  r? 

11.  Compute  the  measure  of  departure  from  linearity,  (V  —  r2)   for 
the  correlation  tables  of  Chapter  VII. 

12.  A    correlation    table    has    two    measures    of    departure    from 
linearity.     Show  that  one   regression  may  be   linear  and  the   other  non- 
linear. 

13.  Show  that  if  the  value  of   r  is  high  the  regressions  must  both 
be  approximately  linear. 

14.  By  extending  the  regression  line  estimate  the  price  of  live  hogs 
for  January,   1917. 

15.  What    weight    should    correspond    to    a    student    height    of    50 
inches  ? 

16.  What   is    the    best    estimate   of    the   temperature    for    a    month 
•with  a  precipitation  of  3.4  inches? 

17.  Discuss  the  value  of  the  probable  deviations  in  exercises  14-16. 


CHAPTER  X. 

CORRELATION  FROM  RANKS. 

Rank  in  a  Series.  When  the  .data  consists  not  of  the 
direct  measurements  of  the  characteristics  but  of  their  order  or 
rank  in  a  series  the  correlation  of  the  ranks  may  differ  mate- 
rially from  the  true  variate  correlation.  Let  us  define  rank  as 
position  in  a  series  so  that  an  individual  of  rank  one  would 
have  no  individuals  above  or  before  it;  an  individual  of  rank 
Pwo  would  have  one  individual  before  it,  etc. 

To  pass  from  rank  to  variate  correlation  it  is  necessary  to 
know  the  form  of  the  distribution  of  the  values  of  the  charac- 
teristics. Only  for  normal  distributions  has  the  requisite  theory 
been  developed.  It  is  consequently  necessary  to  employ  the 
same  formulas  for  other  forms  of  distributions,  although  this 
may  sometimes  open  the  way  to  serious  inaccuracies. 

Let  the  ranks  of  the  same  individual  in  regard  to  the  re- 
spective characteristic  be  VK  and  vv.  Let  there  be  TV  individuals 
and  let  vx  and  vy  denote  the  respective  means  of  the  two  series 
and  svx  and  ovy  the  standard  deviations. 

Also  let  all  the  measurements  of  each  characteristic  be  dis- 
tinct in  value  ;  that  is,  let  there  be  no  equal  measurements. 

Theorem  I.  The  mean  ranks  v*  and  vy  are  each  equal 
to  (N  +  i)/  2. 

Since  there  are  as  many  ranks  as  individual  measurements 
and  since  the  ranks  proceed  uniformly  from  I  to  N  the  mean 
is  (N+l)/2. 

Theorem   II.     The  standard  deviations  of  the   ranks  are 

N 

each  equal  to  —  . 
12 

For,  #<^2~3V*--vx, 


(87) 


88  INTRODUCTION    TO    MATHEMATICAL    STATISTICS 

on  applying  the   rules   X/V2  =  i/6N(N  +   i)  (2N  +   i)    and 

N(N+iY- 
=  i/6N(N  + 


i 
=  _  (TV3  —  N). 

12 

I 

Therefore  ovx2  =  —  (N'~  —  i ) . 
12 

The    following   theorem   is   necessary    for    the    computation 
of  rank  correlation. 

Theorem  III.     //  o-x  =  o-y,  r  =  i  —  -7~ (x  ~  y) 


2 


For,   (.r  —  v)  -  =  .r2  +  yz  —  2.ry, 
or  Na-  (x  _  y)  =  Nax2  +  N<ry2  —  22xy. 

and    3  ( .r  —  y  ) 2  =  S.f2  +  S  v2  —  22^3-, 

But  2xy  =  r  .  N  .  crx  ory. 

Therefore  AV-  (x  _  y)  =  AV  —  2Nrax<ry, 

0-x2  +    0-y2  (T  (x-y) 

and  r= . 


0-"  (x  -  y) 

If  o-x  =  o-y,r  =  I  - 


2o-x2 


Exercises. 

1.     Show  how  to  compute  the  value  of  r  from  the  data  of  Table 


VIII  by  the  formula  r  = 


Theorem  IV.     The  correlation  coefficient   of  the  ranks 
and  vy  is  given  by  the  formula, 


Vy    =     I    - 

N(N-—  i). 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS  89 

On  making  use  of  theorem  3,  we  have, 

"L'(x-y) 
r  vx  i-j.  =  I , 


I  -  -  ,  from  theorem  2, 

I 


12 


.   .-       T 


To  illustrate  the  method  let  us  compute  the  rank  correla- 
tion between  yearly  mean  temperature  and  yearly  mean  rain- 
fall for  Ohio  as  shown  by  the  data  of  Exercises  2  and  3  of 
Chapter  II.  The  yearly  means  are  obtained  by  adding  the 
monthly  means  and  dividing  by  twelve. 

The  order  of  the  twenty-four  years  in  respect  to  tem- 
perature is  written  in  the  first  column  and  in  respect  to  rainfall 
in  the  second.  The  ties  are  disposed  of  by  assigning  the  ranks 
in  the  inverse  order  of  the  time,  thus  1903  and  1902  at  50.5 
and  1903  is  given  rank  15  and  1902,  16.  But  the  matter  of 
ties  will  be  presently  considered.  The  third  column  contains 
for  each  year  the  differences  in  rank  with  respect  to  the  two 
attributes,  temperature  and  rainfall,  and  the  fourth  the  squared 
differences.  On  adding  the  fourth  column  and  applying  the 

62  Ox  —  vy)2 
formula  r=i  —  -  -  we  find  r  =  o.  10. 

N(NS—  O 


9O  INTRODUCTION    TO    MATHEMATICAL    STATISTICS 

Year.  Temp.          Rainfall.      Difference.      Sq.  Diff. 

1911  1439 

1910  17  6  11  121 

1909  13  5  8  64 

1908  5  19  14  196 

1907  22  3  19  361 

1906  9  14  5  25 

1905  -    20  10  10  100 

1904  24  17  7  49 

1903  15  16  1  1 

1902  16  13  3  9 

1901  18  24  6  36 

1900  2  21  19  361 

1899  11  18  7  49 

1898  6  2  4  16 

1897  14  11  3  9 

1896  7924 

1895  21  23  2  4 

1894  3  22  19  361 

1893  10  7  3  9 

1892  19  14  5  25 

1891  8  12  4  16 

1890  4  1  3  9 

1889  11  20  9  81 

1888  23  8  15  225 

N  =  24  2(vK  —  vyy=   2,150 

N(N*—1)  =13,800  6S(yx  —  yy)2  =  12,900 

6Z(,x_,v)* 

N(N*—iy 

=  0.10. 
Ties  in  Rank.     The   application   of   the   formula   r  vx  vy  = 

—  Vy)2. 

is  straightforward  and  direct.     The  only  uncer- 

N(N2—  i) 

tainty  arises  from  ties  in  the  measurements.  Thus  in  the  pre- 
ceding illustrative  example  it  is  found  that  the  temperature  for 
each  of  the  two  years  1907  and  1894  is  52.3.  What  ranks  are  to 
be  assigned  to  each  of  the  measurements?  In  order  to  avoid 
complicating  details  in  an  illustrative  problem,  in  the  preceding 
computation  we  gave  the  later  year  the  numerically  smaller  rank, 
but  ordinarily  it  is  better  to  base  the  ranks  on  one  of  the 
two  plans : 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS  QI 

(i).  The  Bracket  Rank  method,  under  which  the  ties 
are  assigned  the  same  rank  and  that  equal  rank  is  taken  as  the 
rank  next  greater  than  that  of  the  individual  preceding  the  ties. 
The  next  individual  after  the  ties  takes  the  same  rank  as  if 
preceding  ties  had  each  been  given  ranks  differing  by  unity. 
Thus  under  this  method  the  ranks  of  the  illustrative  example 
are  as  given  in  the  table  below. 

(2).  The  Mid-Rank  method,  under  which  all  ties  are 
given  the  same  rank  but  that  rank  is  the  rank  of  the  mid- 
individual.  In  the  column  below  the  two  methods  may  be 
compared. 

Under  either  method  the  total  number  of  ranks  must  be 
the  same  and  equal  to  N. 

Bracket      Mid  Rank 
Temperature.    Method.        Method. 
1911  52.6  1  1 

1900  52.3  2  3 
1894               52.3                  2                  3 

1890  52.3  2  3 

1908  52.1  5  5.5 

1898  52.1  5  5.5 

1892  51.7  7  7.5 

1891  51.7  7  7.5 

1906  51.6  9                  9.5 

1893  51.6  0                   9.5 

1899  51.5  11  11 
1889  51.1  12  12 

1909  50.9  13  13 
1897  50.6  14  14 

1903  50.5  15  15.5 
1902  50.5  15  15.5 

1910  50.4  17  17 

1901  50.2  18  18 

1892  50.1  19  19 
1905               50.0                 20                 20 
189:,                49.9                 21                 21 

1907  49.6  22  22 
1888               49.  r,                 23  23 

1904  48.6  24  24 

2.  Compute  Tvxvy   from  the  above  "bracket  method"  ranks. 

3.  Compute  ?""xyy  from  the  above  "mid-rank  method"  ranks. 


92  INTRODUCTION    TO    MATHEMATICAL    STATISTICS 

Probable  Deviation  of  the  Rank  Coefficient.     As  given  by 
Pearson  ; 

0.6745 

P.      E.      Of        rVXVy    =    —  -        (l     -    rVXVy). 


Perfect  Rank  Correlation.  The  ranks  are  perfectly  cor- 
related, according  to  the  formula,  when  2(vx  —  vy)2  =  o;  that  is, 
when  each  individual  has  the  same  rank  in  both  series.  Also 
there  is  perfect  negative  correlation  when  temperature  and  rain- 
fall are  inversely  related  so  that  the  year  with  the  highest  tem- 
perature is  the  year  with  the  lowest  rainfall  and  so  on  up  to 
the  year  with  the  lowest  temperature  which  is  associated  with 
the  highest  rainfall.  In  this  case  of  perfect  negative  correlation 
when  .A/  is  odd, 


0+1+4  +  9+.  .   - 
AT—  ;i     N+i 

1        _  /V 

•    I    '  ~~   •   ~  ~   •    iv  • 


2N(N'2— 
Therefore  r  =  i 


N(N2--i) 

=  —  i,  a  result  according  with  the  usual  idea  of  inversely 
correlated  attributes. 

Uncorrelated  Data.  According  to  the  formula,  the  sum  of 
the  squares  of  the  differences  of  the  ranks  is  equal  to  the  sum  of 
the  squares  of  the  ranks  when  r  =  o.  Thus  when  r  =  o  sub- 
tracting the  ranks  has  lost  its  significance  —  and  this  is  exactly 
the  idea  of  zero  correlation. 

Hence  the  rank  coefficient  r,  is  accurately  significant  for 
both  perfect  and  zero  correlation. 

A  Correction  Formula  for  the  Rank  Coefficient.  There 
is  no  assurance  however  that  in  general  the  rank  r  will  exactly 
express  the  true  variate  correlation.  For  instance,  note  the  two 
following  series  of  deviations, 

loo,  80,  70,  65,  62,  60,  55,  50,  40,  20;  and  100,  99,  98/97,  96- 
95,  TO,  9-  8,  7. 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS 


93 


The  ranks  are  the  same  in  each  series,  namely, 
i,     2,     3,     4,     5,     6,     7,     8,     9,  10  . 

The  coefficient  7VX  vy  which  depends  solely  on  the  ranks,  has  the 
same  value  for  a  series  of  which  the  first  is  typical  as  it  does 
for  a  series  of  which  the  second  is  typical.  And  yet  the  two 
distributions  are  fundamentally  distinct  in  form. 

Therefore,  except  for  the  two  extreme  cases  of  material  of 
very  high  and  of  very  low  correlation,  the  value  of  a  correlation 
constant  computed  from  ranks  must  be  interpreted  with  caution. 

For  a  distribution  which  is  approximately  normal  in  form 
the  following  correction  formula  for  ^vx  vy  has  been  derived  by 
Pearson.* 


From  Table  X  the  values  of  rxy  can  be  obtained  directly 
from  the  value  of  rvxvy  for  each  0.05  of  n/xvy. 

Corresponding  Values  of  rxy  and  r  vx  vy. 


0.00 
0.05 
0.10 
0.15 
0.2<» 
0.25 
0.30 
i).:?.-) 
0.40 
0.45 
0.50 


xy                           x  y 

0.00 

0.55 

0.06 

0.60 

0.10 

0.65 

0.16        i        0.70 

0.20 

0.75 

0.26 

0.80 

0.31                0.85 

O.M                 0.90 

0.42                 0.95 

0.47 

l.oo 

0.52 

xy 

0.57 
0.62 
0.67 
0.72 
0.77 
0.87 
0.86 
0.91 
0.96 
1.00 


TABLE    X. 


For  other  values   of   n/xi/y  the  corresponding  values  of   r, 
are  readily  obtained  by  interpolation. 


*"On    Further    Methods  of   Determining   Correlation"    Drapers    Co. 
Research  Memoirs:     Bio.  Ser.  IV. 


94  INTRODUCTION    TO    MATHEMATICAL    STATISTICS 

Probable    Deviation  of  rxy   Computed  from   Ranks.     As 
given  by  Pearson* 

o  .  7063 

V 


P.  E.  of  rxy  from  ranks  =  —    i  — 


Exercises. 

4.  Determine  Txy  from  the  value  of  ^"x"y  °*  Exercise  1. 

5.  Compute  the  value   of  the  rank  r  from  the  data  of  other  ex- 
ercises and  compare  with  the  computed  values  of  the  variate  r. 

Theorem  V.  Multiplying  each  rank  i/x  or  vy  by  a  constant 
does  not  change'  the  value  of  r. 

For  if  each  rank  is  multiplied  by  m,  the  means  vx  and  vy 
are  each  multiplied  by  m  so  that  the  standard  deviations  are  each 
multiplied  by  m2.  Also  2(vx  —  vy)2  is  multiplied  by  m2  and  hence 
r  is  unchanged. 

Theorem  VI.  Multiplying  the  ranks  cf  one  coluvtn  but  not 
of  the  other  in  general  produces  a  change  in  the  value  of  ^Vxvy 

In  this  case  the  formula  r  =  I  --  .  T  ,  *  -  ^ 

N(N2  —  ) 

cannot  be  used  and  the  formula 

x*  —  2  (V,  —  VZ)  (l'y  —  Vy) 


must  be  employed. 

The  Accuracy  of  the  Coefficient  rxy  when  computed  from 
Ranks.  When  the  measurements  are  arranged  in  ranks  and 
the  coefficient  is  computed  from  the  ranks  alone,  the  computation 
is  based  on  the  relatively  limited  information  which  the  ranks 
can  convey.  Hence  the  resulting  coefficient  can  not  be  as  trust- 
worthy and  reliable  as  the  moment  cofficient.  However,  when 
a  detailed  correlation  table  cannot  be  constructed  owing  to  a 
paucity  of  information,  it  may  still  be  possible  to  determine  the 
rank  of  the  individual.  If  proper  allowance  is  made  for  the 
necessarily  wide  inaccuracy  of  the  computed  result,  the  rank  co- 
efficient is  better  than  no  coefficient  at  all  for  such  inaccurate 
or  indeterminate  data. 


*  "On  Further  Methods  of  Determining  Correlation".     Drapers  Co. 
Research  Memoirs  :     Bio.  Ser.  IV. 


CHAPTER  XI. 
THE   MOMENTS   OF  A   DISTRIBUTION. 

Introduction.  The  first  moment,  obtained  by  multiplying 
each  deviation  by  the  corresponding  frequency,  adding  the  result- 
ing products  and  dividing  by  the  total  frequency  of  the  distribu- 
tion, was  discussed  in  Chapter  IV  in  connection  with  the 
arithmetic  mean.  The  second  moment,  in  which  the  deviations 
are  squared  before  multiplication  by  the  frequencies,  was  dis- 
cussed in  Chapter  V.  The  third  and  fourth  moments,  with  the 
deviations  cubed  and  raised  to  the  fourth  power  respectively, 
were  also  referred  to  in  Chapter  V. 

Obviously  the  moments  may  be  computed  about  any  point 
by  obtaining  the  deviations  from  that  point  and  raising  to  the 
appropriate  power,  etc.  For  most  purposes,  however,  the  sec- 
ond and  higher  moments  are  computed  about  the  mean  which 
thus  serves  as  a  standard  origin  for  the  moments. 

The  moments  about  the  mean  are  denoted  by  the  symbols 
fi'  P'2'Px'  f-4'  e^c-  where  the  subscripts  refer  to  the  order  of  the 
moments ;  that  is,  the  index  of  the  power  to  which  the  devia- 
tions are  raised.  Under  the  same  system  of  notation,  the 
moments  about  any  other  point  are  denoted  by  /*/,  /x2',  /*3',  //,/ 
etc.,  with  the  primes  serving  to  distinguish  moments  about  the 
mean  from  moments  about  any  other  origin. 

The  moments  about  the  mean  may  be  computed  directly  by 
first  computing  the  mean  and  then  subtracting  the  value  of  the 
mean  from  each  deviation  and  using  the  resulting  differences  in 
the  computations  for  the  moments.  This  method  of  computing 
the  moments  has  the  advantages  of  simplicity  and  directness  but 
it  usually  leads  to  troublesome  fractions  and  it  ordinarily  in- 
volves more  labor  than  the  indirect  methods  which  are  described 
in  this  chapter. 

Transformation  Formulas  for  the  Moments  about  the 
Mean.  The  formulas  for  the  moments  about  the  mean  in 
terms  of  the  moments  about  a  fixed  point  will  now  be  derived. 

(95) 


96  INTRODUCTION    TO    MATHEMATICAL    STATISTICS 

Let  d  be  the  mean  deviation,  that  is,  the  distance  of  the  mean 
from  the  fixed  point  of  reference,  and  let  the  x*s  be  measured 
from  the  mean.    Then  corresponding  to  a  given  value  of  x  there 
will  be  the  deviation  x'  about  the  fixed  point. 
From  the  definition  of  a  moment  we  have, 


Hi  +  d  =  d,     since    ^xy    is    zero  (Theorem, 

page  35,  Chapter  IV)  ; 

i  i  d  Nd* 


fL2  -f-  d-,  since  5  xy  =o; 

i  i  $d  3d2  Nd3 

—2  (x  +  d)  sy=—  2  .r3y  +  —2  x2y  +  -  —  S  xy  -\  — 

N  N  N  N  N 


-- 

N 


\d  6d2  4ds  Nd4 

2.r3v  -f  -  -  2-r2  v  H  ---  2-n*  +  - 

AT  AT  AT  N 


Transposing  a  part  of  the  terms  in  the  four  preceding  equa- 
tions and  changing  the  signs,  we  have  the  following  equations 
which  express  each  moment  about  the  mean  in  terms  of  the  cor- 
responding moment  about  the  fixed  point  and  the  moments  of 
")     lower  order  about  the  mean  : 

Ml  r=  M,  '  —  d  =  o,  since  u/=  d; 


These  formulas  for  transferring  the  moments  from  a  fixed 
point  to  the  mean  are  arranged  in  what  is  called  the  continuous 
form;  that  is,  they  begin  with  the  moment  of  lowest  order  and 
proceed  step  by  step  up  to  the  fourth  moment. 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS  97 

Exercises. 

1.  Compute  the  third  and  fourth  moments  for  the  student  height 
data  at  the  beginning  of  Chapter  III. 

2.  By  taking  the  fixed  point  of   reference  at  various  points  show 
that  for  the  data  of  Student  Heights  the  third  and  fourth  moments  are 
least  when  computed  about  the  arithmetic  mean. 

3.  Show  algebraically  that  any  moment  is  least  when  taken  about 
the  mean. 

4.  Show  that  having  the  formulas  arranged  in  the  continuous  form 
does  not  involve  additional  labor  of  computation. 

1 

5.  By   expanding   vn  = —  X(X —  rf)n   according   to    the   binominal 

N 

theorem  where  x  is  measured  from  the  fixed  point  of  reference,  derive 
the  general   formula  expressing  ^,n  in  terms  of  the  moments  about  the 
fixed  point. 
^  6.     Specialize  the  formula  of  (5)  for  the  third  and  fourth  moments. 

7.  Compute  the  third  and  fourth  moments  from  the  data  of  Table  I 
by  using  the  formulas  of  exercise  6. 

8.  Find    the    first,    second,    third    and    fourth    moments    about    the 
mean  of  a  distribution   with   frequencies   proportional  to  the   successive 
terms  of  the  expansion  of  the  binomial   (p  -f- />)n. 

Ans.  t*?  =  npq  (p  —  q)  ;  ^^=npq  (3(w  —  2)  pq  -f  1).  (See  Hardy, 
"Construction  of  Tables  of  Mortality,"  p.  107  et  seq.). 

The  computation  of  the  moments  about  the  mean  either 
directly  or  by  first  computing  about  a  convenient  origin  and 
then  transforming  to  the  mean  is  open  to  the  serious  practical 
objection  that  there  are  no  convenient  methods  of  checking  the 
results.  The  arithmetic  of  the  following  summation  methods  is 
comparatively  brief  and  admits  of  satisfactory  checks  on  the  cor- 
rectness of  the  results. 

The  First  Summation  Method  of  Computing  the  Mo- 
ments. The  theory  of  this  and  the  second  summation  method 
which  follows  immediately  after  it  are  somewhat  detailed  but 
both  are  entirely  elementary  throughout. 

Let  us  take  a  distribution  with  the  five  frequencies  y^y^y*, 
y*>  y$  corresponding  to  values  of  x  equal  to  1,2,  3,  4,  5.  By  the 
ordinary  direct  method,  the  first  moment  about  the  point 
x  =  o  is  y^  +  2^2  +  3^3  +  43U  +  53V  Now  let  us  arrange  the 


98  INTRODUCTION  TO    MATHEMATICAL    STATISTICS 

y's  in  vertical  order  and  add  in  the  manner  indicated  in  the  sec- 
ond column  below. 

(i)  (2)  (3) 

yi  y*  +  y2+  y3  +  3'4  +  y5  y\  +  23'2  +  33>3  4-  4>'4  +  53v 

y2     y*+  3'3  +  3'4  +  3v,    $2  +  2$*+  33^  +  43v 

ys        y3  +  3'4  +  y5       ya  +  zy*  +  33v 

y*,  3'4  +    3V                                       3U  +  2>V> 

3^6                                                   3?5  3V, 


+  33'3  +  43'4  +  53V  3'i  +  33'2  +  63-3 
(4)  (5) 


2O>>4       353;, 

33'3  +    %r4  +  ioy5  3'2  +    4V3  +  io.v4  +  203-, 

3's  +    33'4  +    6v5  3?3  +    43/4  +  103-,, 

3'4+    33;5  3'4+    43V, 

3'r,  3^5 


3fi  +  4>'2  +  103-3  +  2ov4  +  353V,     3',  +  53'2  +  153,3  +  35y4  +  703^ 

The  sum  of  the  second  column  is  thus  the  same  as  the  first 

moment.     By  the  direct  method  the  second  moment  about  the 

j 

,*v     same  point  is  3^  +  4y2 ,+  93/3  +  i6v4  +.253/5  divided  by  N.   Let 
Tt.    us  designate  the  sum  of  column   (2),  when  divided  by  N,  by 
S;  the  second  divided  by  N,  by  5V, ;  the  third  when  divided  by 
N,  by  ^3,  etc. 

yi  +  w*  +  33'a  +.,-....          y,  +  A  +6^3  +  . . . 

That  is  5    =  —  —    S  =- 


yi  +  4J2  +vI03V  +  .-.  3'   +  SVs  +  153^3  +  •  .  . 

•$4  —  ~  >         S5  " 

AT  A^ 

It  is  apparent  on  inspection  that  2$3  —  S2  is  the  second  moment. 
In  symbols, 

2  i 


-       -    y:       2y 
N  N 

i 
+  53V)  =  —  (3(!  +  43'2  +  93's  +  i6;y4  +  253;,). 

N 

That  is,  fif2  =  2$3  —  S2. 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS  99 

The  third  moment   about  the  same  point  of   reference  is 

I 

-  (Vi  +  8}'2  +  2?y3  +  643;,  +  I2$y5). 

N 

For  this  moment  the  following  relation  is  readily  verified  : 

n'3  ==6S,~6S3  +  S2. 

Extending  the  reasoning  to  the  case  of  the  fourth  moment, 
we  have 


We  thus  have  four  relations  connecting  the  moments  with 
the  S's: 

P\  —  ^2> 

fi2  =  25:{  -  S2, 

fA3  =  6S4  —  6^3  +  S2, 

fi'4  =?  245,,  —  36S4  +  14^3  —  S2. 


Transferred  to  the  mean  as  origin  by  the  formulas  of  page 
-88""  these  moments  become 
S2  =  d; 
^  =/u'2  _  #  =  25,  _  S,  —  d*  =  2S.  — 


d*  -  654  —  6S3  +  S,  - 
3d  fi2  —  (8, 

6S4-3/x2  —  3^(1+  d)+  d- 
3d  t*2  —  d\ 


and  similarly,  /x4  =  245*.-,  —  2/*3  •{  2(1  -\-  d)  +  i}-  —  /u,2  \6 

(i  +  rf)  (2  +  d)-i\-d(i  +  d) 

(2  +  d)     (3  +  rf). 

It  is  evident  that  the  same  relations  hold  for  a  larger  num- 
ber of  classes  than  the  five  which  we  have  assumed  for  the  pur- 
pose of  illustrating  the  method. 

These  relations  connecting  the  moments  about  the  mean 
with  the  sums  obtained  by  this  process  of  summation  are  ma- 
terially shorter  and  more  convenient  than  the  direct  formulas. 
It  will  be  noticed  that  the  sum  of  any  column  is  the  largest 


IOO  INTRODUCTION    TO    MATHEMATICAL   STATISTICS 

number  in  the  next  column,  so  that  a  satisfactory  check  on  the 
summation  is  afforded.  It  is  possible,  however,  by  taking  the 
point  of  reference  near  the  mean,  to  still  further  shorten  the  labor 
of  the  computation. 

The  Second  Summation  Method  of  Computing  the  Mo- 
ments. To  illustrate  the  second  method  let  us  take  a  distri- 
bution of  eight  classes  and  assume  the  fixed  point  of  reference 
at  class  5.  Then  we  sum  from  both  top  and  bottom  to,  but  not 
including,  the  frequencies  of  class  5  in  accordance  with  the  fol- 
lowing scheme.  £ 

V  ^ 

(i)  (2)  (3)  (4) 

yi  yi  yi  y* 

y2  y2  +  yi  ya  +  2y1  y2  +  3^1 

y*  y3  +  y2  +  yi     '  y3  +  23/2  +  33-t      ys  +  3y2  +  6yi 

y*    y«  +  ys  +  y*  +  y^  y*  +  ^ys  +  33/2  +  43^ 

y« 

ye          3^6  +  y?  +  ys  y*  +  zy7  +  3  y*     y*  +  3^7  +  6ys 

y7  3'7  +  3's  ^7  +  23^8    '  3'7  +  33's 

ys  ya    ,  y*  y* 

(5)  (6) 

yi  yi 

3^2+    43'i 


^  +  43^7  +  ioyg       TO  +  53^7  +  153's 

y7+  4y8  y7+  53>8 

y8  y8 

Forming  n\,  about  the  point  x  =  5,  by  the  direct  method 
we  have 

i  i 

/*'i  =     -  (3'4  +  23'3  +  33^2  +  43'i)  +  -  -  (y6  +  2y7  +  3y8) •  • 

A^  TV 

But  //!  has  been  defined  as  equal  to  S2.  Hence  S2  is  obtained 
by  subtracting  the  last  upper  summation  term  from  the  last 
lower  term  in  column  (3). 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS  IOI 

By  direct  computation, 

i 


43'7 

TV 

But  p!,  =  2Ss  —  5*2.  or  2^3  =,/i'2  +  5"2. 

I 
Hence  2$3  =  —  (y4  +  4y3  +  $y2  +  i6yl  +  y«  +  4y7  +  9^8 

N 

I 


i  i 

Therefore,  S3  =    -  (v3  +  3V2  +  6va)  H  ---  (^  +  3.V?  +  6y8). 

TV  TV 

That  is,  ^3  is  the  sum  of  the.  last  term  in  the  positive,  or  lower, 
summation  and  the  last  but  one  (the  last  term  as  written  in 
the  scheme)  of  the  negative  summation  terms  in  column  (4). 

Likewise, 

i  i 

S4  =  —  (y6  +  4y7  +  I0y8)  --  (3;.,  +  4V!  )  ,    the    difference 

TV  TV 

between  the  last  positive  summation  term  and  the  last  but  two  of 
the  negative  summation  terms  in  column  (5). 

i  i 

And  5\  =  —  (y6  +  4^7  +  J53'«)  H  ---  yi  tne  sum  of  the  last 

TV  TV' 

positive  summation  term  and  the  last  but  three  of  the  negative 
terms  in  column  (6). 

After  the  S's  are  obtained  the  formulas  of  page  91  are  ap- 
plied to  obtain  the  /x's. 

As  in  the  first  summation  method  the  partial  summations 
can  be  added  for  checks  on  the  Arithmetic. 

This  second  summation  method  will  be  found  very  con- 
venient, especially  when  the  number  of  classes  is  large  or  the 
frequencies  are  of  considerable  size. 


IO2 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS 


The  following  computations  for  the  data  of  page  24  illus- 
trate the  two  summation  methods. 

Computations  of  this  length  should  never  be  attempted 
without  first  arranging  a  complete  form  with  a  place  for  each 
number  and  that  place  so  chosen  that  the  number  is  in  its  most 
convenient  location.  The  entire  computation  should  be  planned 
before  the  arithmetic  is  begun. 


"Class. 

Freq. 

1 

2 

750  - 

5925 

28463 

105421 

2 

10 

748 

5175 

22538 

76958 

3 

11 

738 

4427 

17363 

54420 

4 

38 

727 

3689 

12936 

37057 

5 

57 

689 

2962 

!)247 

24121 

6 

93 

632 

2273 

6285 

14874 

7 

106 

539 

1641 

4012 

8589 

8 

126 

433 

1102 

2371 

4577 

9 

1Q0 

307 

669 

1269 

2206 

10 

87 

198 

362 

600 

937 

11 

'75 

111 

164 

238 

337 

12 

23 

36 

53 

74 

99 

13 

9 

13 

17 

21 

25 

14 

4 

4 

4 

4 

4 

Totals 


750 


5925   28463   105421 


329625 


,  =  37.95    54  =  140.56    Ss  =  439.5 


(1). 

(2). 

(3). 

(4)., 

(5). 

(6)  . 


t/(l  +  rf)  =  70.31 
(l  +  rf)    (2-f-rf)        =696.069 

3(l  +  d)  =26.7 
4(1  +  <f)  +2  =37.6 
6  (1-HO  (2  +  d)  —  1  =  527.66 


=5.502 


=  65*4  —  /v  (4)  .—  (3)  .  =  —2.015 


=  24S"5'  —  /*,  (5)  —  /*,  .  (6)  .  —  (3  +  rf)  .  (3)  .  =  85.937 


Notice  that  the  sum  of  the  first  column  =  750,  the  last 
sum  at  the  top  of  the  S2  column,  and  similarly  for  each  following 
column.  The  computation  form  is  arranged  for  the  use  of  a 
"Millionaire"  calculating  machine. 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS 


The  second  summation  method  is  started  as  follows : 
Class        Freq. 


I03 


1 

2 

2 

2 

2 

2 

2 

2 

10 

12 

14 

16 

18 

20 

3 

11 

23 

37 

53 

71 

91 

4 

38 

61 

98 

151 

222 

5 

57 

118 

216 

367 

6 

93 

211 

427 

7 

106 

8 

126 

433 

1102 

2271 

4477 

8085 

g 

109 

307 

609 

1269 

2206 

3608 

10 

87 

198 

362 

600 

937 

1402 

11 

75 

111 

164 

238 

337 

465 

12 

23 

36 

53 

74 

99 

128 

13 

9 

13 

17 

21 

25 

29 

14 

4 

4 

4 

4 

4 

4 

Totals..          .    7:>0         1102        2271         447' 


8085 


>  S,=  (1102  —  427)/750  =    0.9 

Ss  =  (2271  +  367)/750  =  3. 52 
S4  =  (4477  —  222 )  /7.50  =  o .  67 
St=  (8085+  91)/750  =  10.9 

The  computation  from  this  point  on  is  the  same  as  under  the 
first  method,  except  that  the  origin  is  at  class  7,  or  the  height 
class  67,  instead  of  height  class  60. 

Exercises. 

9.  Compute   the    moments    for   the    frequency  distribution    of    page 
29  by  the  two  summation  methods. 

10.  Demonstrate  the   proof   of   the   two   summation   methods    for   n 
classes. 

11.  What  difference  would  result  in  the  computations  of  the  second 
summation  method  if  the  origin  were  taken  at  the  eighth  class  instead  of 
the  seventh  so  that  the  upper  sum  in  the  first  summation  is  the  larger? 

Correction  Formulas  for  the  Moments.  All  the  methods  that 
have  been  proposed  for  finding  the  moments  assume  that  the  frequencies 
are  concentrated  at  the  center  of  each  class  while  actually  the  deviations 
are  continuously  distributed  from  one  end  of  the  range  to  the  other  so 
that  there  is  nothing  in  the  nature  of  the  data  to  correspond  to  the 
classes,  mid-ordinates,  etc.  A  certain  degree  of  error  is  therefore  intro- 
duced by  these  methods.  We  are  not  really  working  with  the  actual 
deviations  but  with  the  artificial  classes  built  up  from  the  actual  devia- 
tions. In  how  far  then  are  facts,  which  hold  for  the  classes,  of  sig- 
nificance for  the  actual  variates?  It  may  well  be  that  in  ordinary  statis- 


IO4  INTRODUCTION    TO    MATHEMATICAL    STATISTICS 

tical  work  the  closeness  of  the  measurements  may  not  warrant  taking 
these  errors  into  account  but  the  corrections  are  easily  applied  and  fre- 
quently make  a  significant  difference  in  the  results.  However  the  cor- 
rections should  not  be  applied  to  data  not  accurate  enough  to  warrant 
such  care  no  matter  ifthe  corrections  are  easily  applied.  The  methods 
adopted  in  computation  must  never  be  such  as  to  presuppose  more  accu- 
rate data  than  that  in  hand. 

When  the  distinction  is  made  between  the  moments  as  calculated 
from  the  class  frequencies  and  deviations  and  the  moments  calculated 
under  the  assumption  of  continuous  variation,  it  is  customary  to  denote 
the  values  as  computed  by  *\  v^  v3>  v4j  and  "/,  "2',  »V,  ?/,  and 
reserve  the  corresponding  /"•'$  for  the  values  under  the  assumption  of 
continuity.  When  no  account  is  taken  of  the  distinction  between  the 
discrete  and  continuous  series  of  frequencies,  the  A*'s  alone  are  used, 
The  "'s  are  often  spoken  of  as  the  raw  or  unadjusted  moments  and 
the  /*'s  as  the  adjusted  moments. 

The  adjustment  or  correction  formulas  are  : 


="3 


The  theory  of  these  corrections  is  due  to  Dr.  Sheppard 
and  to  Professor  Pearson.  A  simple  demonstration  of  the 
formulas  is  that  of  Bio.  IIT,  p.  308. 

According  to  the  underlying  mathematical  theory  these  cor- 
rection formulas  hold  in  strictness  only  for  a  frequency  curve 
with  high  contact  at  each  end.  When  these  conditions  are  not 
satisfied  it  is  probably  best  not  to  apply  the  corrections. 

Theorem  I.  Changing  the  unit  of  measurement  of  the 
deviation;  that  is,  multiplying  each  deviation  by  a  constant, 
multiplies  a  moment  by  that  constant  raised  to  a  power  equal  to 
the  order  of  the  moment.  For, 


Mn  =  — 

TV 


and  2  (  rx  )  nr  =  rn2.rn  v  . 


Theorem  II.     Multiplying  or  dividing  each  frequency  by  a 
constant  does  not  change  the  moments.     For, 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS  IO5 

Because  the  values  of  the  third  and  fourth  moments  depend 
on  the  unit  of  measure  of  the  deviations  it  is  usual  to  employ 
these  two  moments  in  the  forms  &  and  /?2,  respectively,  where 
&  =  /V//*23  and  /32  =  /*4/>22.  To  show  that  ^  and  (32  are  inde- 

pendent  of  the  unit  of  measure  of  x  let  us  write  /?!  = 


and  fi.2  =  -  —.    Then  let  .r  be  changed  into  r#  where  r  is 

(*»V 

any  constant. 


y)2.r*     N 
This    ^ives  /81=—  —  ,  and  similarly  for  ft2. 


Exercises. 

12.  Show   that   adding   a   constant   to    each   deviation   changes   the 
moments. 

13.  Show   that   adding   a   constant   to    each    frequency   changes    the 
moments. 

Summations.  The  following  exercises  are  intended  for 
practice  in  using  summations  and  should  be  carefully  worked 
through  in  order  that  a  comprehension  of  the  somewhat  de- 
tailed discussions  of  subsequent  chapters  is  not  hindered  by  a 
lack  of  familiarity  with  the  necessary  algebra. 

Exercises. 


14.  Show   that    the    square    of    ^xy   is    2-*"V  +  22'*"s:ys*t;yt    where 
the  subscripts  are  attached  in  the  second  summation  to  indicate  the  prod- 
uct of  unequal  deviations,  and  all  deviations  are  measured  from  the  mean. 

15.  By   actually   computing  the   separate   value   of   each   summation 
verify    the    relation    (2-ry)2  =  2-rV  -f-  ^^Kys^tyt    for    the    distribution 
1,  2,  5,  2,  1. 

16.  Establish  the  relation    (S-ry)3  —  S^V  +  322  W^r 

17.  Establish     the     relation      (2-rv)4  =  2*V  + 


18.  Show  that  (2-r.v)  (^y)  = 

19.  Prove  that  x\  +  x\  >  2.rs^t  and  hence  2(^*B  +  •*•*>)>  22.rs.rt. 

20.  Show  that  (2.r\v)  (2y)  >'i) 
We  have 

and    (2y) 

.r4i.  Vi^-f    •     •     •  +  .r*=.v2.vi  +  x\y&  +    .    .    .    .  -f 
-  syy  +  2  ^y-y  t  =  2^V  +  2^^  t  (*48  +  ^4t  )  , 

Also    (  2.rsy  )  •  =  S.rV  -f  22.^3^.3.  t  =2,ry  +  22^^^ 


IO6  INTRODUCTION    TO    MATHEMATICAL    STATISTICS 


Therefore    (2^)    (Zy)  >  (2^)', 

if  2,-y  -f  Zyayt  (Sa  +  *\)  >  2*y  4-  2 

i.  e.  if  2Vt«  +  <)  >223Wr»8Jr*t. 

But  the  sum  of  the  squares  of  two  quantities  is  always  greater  than 
twice  their  products  and  hence  each  term  on  the  left  is  greater  than  the 
corresponding  term  on  the  right,  thus  proving  the  theorem.  The  algebraic 
discussion  may  be  more  easily  followed  if  a  summation  of  only  two  or 
three  terms  is  first  employed. 

21.  Prove   that    (2.r4v)  (2jray)  >  (2*3;v)?. 

22.  Prove  that  &  >&. 

The  Moments  and  the  Equation  of  the  Smoothed  Curve. 

It  is  shown  in  Chapter  II  that  a  smooth  curve  is  fitted  on  the 
basis  of  principles  which  are  assumed  true  for  the  data  as  a 
whole.  One  such  principle  is  that  of  equality  of  area  which 
assumes  that  the  area  under  the  curve  is  equal  in  numerical 
value  to  the  total  frequency  of  the  distribution. 

The  principle  of  equality  of  moments  assumes  in  addition 
to  the  equality  of  area  and  total  frequency  that  the  first,  second, 
third  and  fourth  moments  computed  directly  from  the  data  are 
respectively  equal  to  the  first,  second,  third  and  fourth  moments 
computed  from  the  adjusted  frequencies. 

To  illustrate  the  application  of  the  method  of  equality  of 
V  moments  let  us  fit  a  straight  line  to  the  points  (2,4),  (3,3), 
(4.7).  (5,6). 

The  equation  of  the  required  line  is  y  =  m.v  -\-  b  where  m 
and  b  are  to  be  determined.  The  adjusted  y's  in  terms  of  m 
and  b  are  2m  +  b,  3111  +  b,  4m  -\-b,  $m  +  b. 

The  equality  of  the  area  and  the  total  frequency  can 
be  expressed  as  an  equality  of  moments  if  the  moment  of 
zero  order  is  permitted.  This  is  possible  because  any  num- 
ber with  an  exponent  zero  is  equal  to  unity.  Hence  2°  .4  -f- 
30.3  +  4°.7  +  5°.6  =  4+3  +  7  +  6  =  2o.  Also,  2°.(2m  +  b) 
+  3°.(3»*  +  b)  +  4°.(4»*  +  b)  +  5°.  (5^  +  b)  —  i4w  -f-  46. 

Hence,  on  equating  the  two  zero  moments, 

1  4m  -f-  b  =  20. 

From  the  first  moment,  2.4  +  3.3  -)-  4.7  -f-  5-6  =  2  (2m  -f-  b), 
+  3(3™  +  b)  +  4(4111  -f-  b)  +  5(5w  -f  b),  we  have 

+  146  —  .  75. 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS  IO7 

Solving  these  two  moment  equations  simultaneously  we  have 


Therefore  y  =  2*5*  -  $rft  is  the  required  equation  of  the 
straight  line  fitted  to  the  given  points  on  the  basis  of  the  asump- 
tion  of  equality  of  the  zero  and  first  moments  respectively. 
^ 

Exercises. 

^  23.  Fit  a  straight  line  to  the  preceding  illustrative  points  on  the 
assumption  of  equality  of  the  first  and  second  moments  respectively. 
Should  the  resulting  equation  agree  exactly  with  the  equation  found 
above  ? 

K24.     Fit  a  straight  line  to  the  points  (1,5),  (3,8),  (4,6),  (5,5),  (7,10). 

w  25.  Fit  a  parabola,  y  =  a  -f  bx  -\-  ex*,  to  the  points  of  Exerc- 
cise  24. 


CHAPTER  XII. 
FURTHER  THEORY  OF  CORRELATION. 

A  Second  Concept  of  Correlation.  In  Chapter  VII  two 
attributes  are  said  to  be  correlated  when  there  is  a  tendency  for 
a  change  in  the  value  of  one  to  be  followed  by  a  change  in  the 
value  of  the  other.  And  the  ratio  of  the  standard  deviation  of 
the  means  of  the  arrays  to  the  standard  deviation  of  all  the 
variates  was  taken  as  the  measure  of  the  degree  of  correlation 
between  the  attributes.  A  second  approach  to  the  matter  of  cor- 
related variates  is  as  follows. 

On  the  assumption  that  the  mean  is  the  representative  of 
the  variates  of  an  array  the  dependence  of  y  on  x  is  exhibited 
by  the  curve  of  means  ;  that  is,  by  the  regression  curve.  Obvi- 
ously this  curve  is  a  significant  measure  of  the  dependence  of 
y  on  x  only  insofar  as  the  means  are  in  fact  representatives  of 
the  variates  of  the  respective  arrays.  Within  this  limitation 
the  spread  of  the  variates  about  the  means  of  the  successive 
arrays  is  a  measure  of  the  extent  of  dependence  of  y  on  x\  that 
is,  of  the  correlation  of  y  with  x. 

Let  3"ay  denote  the  mean  squared  divergence  from  the  re- 
gression curve.  Then 


As  is  explained  in  Chapters  VIII  and  IX,  this  mean  squared 
deviation  must  be  divided  by  <ry2  in  order  to  obtain  a  correlation 
index  of  value  for  purposes  of  comparison. 

To  reduce  G"ay2  we  write, 

(y  —  y  +  y  —  ;yx)2, 


Nov2  +  Nrr'V 

2^nKy(y  —  y)(y  —  yj. 
(108) 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS  IOO, 

But222ftxy(;y  —  y)(y  —  y*)  =  2$K(y  —  y^y1iKy(y  —  y) 


z 
z  Ga.y 

Therefore  N&y  =  Ncry2  —  A^/Vy2,  and  --  =  I  — 

<V 

On  division  by  N&y2,  we  have, 


It  accordingly  appears  that  this  second  measure  of  correla- 
tion is  not  independent  of  the  first,  being  equal  to  I  — tf. 

Exercises. 

1.  Show  that  the  mean  spread  of  the  variates  about  the  regression 
line  of  Chapter  IX.  is  equal  to   (1 — f2)0^2. 

2.  Show  that  (Jay2  is  zero  for  perfect  correlation  and  equal  to  vj 
for  zero  correlation. 

3.  Discuss   the   convenience    of   the   computation    formula    V  -- 1  — 
CFfly'/Gy"- 

4.  Which  approach  to  the  numerical  measure  of  correlation  seems 
clearer? 

Derivation  of  the  Equations    of    the    Regression    Lines. 

Let  the  regression  equations  be  of  the  form: 


and  ^y  =  bxy  .  y  -f-  c. 

The  moments  of  order  zero  and  unity  give  for  the  regres- 
sion of  y  on  x  two  moment  equations  : 

Sw^x  =  fryx2wx  .  x  +  Na, 
and       2w^yx  .  x  =  fcyx2wx  .  x2 


As  explained  on  page  75  the  moments  are  computed  for  each 
individual  frequency  of  an  array.  Hence  we  have  5  Wy  .  51, 
and  not  merely  2  $*.  For  the  same  reason  a  appears  in  the 
moment  sum  once  for  each  frequency;  that  is,  TV  times. 


110  INTRODUCTION    TO    MATHEMATICAL    STATISTICS 

I  I 

Since  yx  =  -  Sywxy  .  y,  2xnxyx  =  S,«,  .  —  2ynxy  .  y  = 

«x  «x 

=  2X  2y  w-xy  3>,  a  first  moment  about 

a  horizontal  line  thru  the  mean,  (both  y  and  x  are  assumed  to 
be  measured  from  their  respective  means)  and  hence  zero  from 
an  obvious  extension  of  the  theorem  of  page  35. 

Likewise  3nx  .  x  =  o.  Hence,  from  the  first  moment 
equation,  a  as  equal  to  zero. 

In  the  second  equation  we  have, 

.  y  .  x. 

The  summation  2wx  •  x~  has  been  taken  equal  to  NG^  and 
2wy  .  y~  equal  to  NGy'~.  It  seems  consistent  with  this  notation 
to  assume  22wxyv.r=  Nro-x(Ty  where  r  is  the  numerical  constant 
of  Chapter  IX. 

On  reducing  the  second  moment  equation  we  have 
N^crxo-y  =  byx  .  N(TX2 

CTy 

Therefore  &yx  =  r  .    — 
o» 

CTy 

and  hence  yx  =  r  .  —  .  x  is  the  required  regression  equation. 


Exercises. 

ff* 
5.     Derive  the  regression  equation  xy  =  r  —  y. 


6.     Prove  in  detail  that  ^^nxyy  =  of  where  x  and  y  are  measured 
from  the  mean. 

When  x  and  y  are  measured   from   the  original   axes   the 
regression  equations  become 

o-y 

yx  —  y  =r—  (x  —  x) 


xy  —  x  =  r—  (y  —  y) . 

CTy 

The  Relation  Between  ^  and  r.     It  was  shown  in  Chapter 
IX  that  r)  and  r  have  the  same  numerical  value  when  the  regres- 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS  III 

sion  is  truly  linear.  Hence  a  lack  of  agreement  in  the  values 
of  rj  and  r  is  an  indication  of  a  divergence  from  linearity  in  the 
regression.  The  difference  between  r?  and  r  is  expressible  in  terms 
of  the  divergence  from  linearity  by  the  two  equations: 

MV  (V  —  r'2)  =  2"x  (  Y*  —  y*)2 

and  No-x-(i)x-  —  r2)  =  2«y  (Xy  —  ^y)2,  where  Fx  and  Xx  are 

the  regression  line  means. 

To  prove  the  first  of  these  formulas  let  us  add  and  subtract 
y  for  each  term  in  the  summation  2^x(Fx  —  yx)2.  We  then  have 
after  expansion, 

&)2  =  2fM(^  —  y)2  —  2(FX  —  y)  (yx-y)  + 


On  substituting  from  the  regression  equations  this  expanded 
form  becomes 


Oy  <Ty 

2  •  r"~  —  +  2HX  (Vx  —  V)  -  —  2$«x  .  r  —  (y,—  y)  (  *—  *)  , 


which  equals 


Substituting  for  the  last  term  and  collecting  we  have 


Exercises. 

7.  Prove  the  formula  2ny(X  —  ^y)  =  Mrxa  (^  —  1*). 

8.  Show  from  these  formulas  that  "n  >  r. 

9.  Show  that  the  same  pair  of  equations  will  be  obtained  for  the 
regression  lines  if  the  assumes  lines  are  fitted  to  the  individual  frequen- 
cies instead  of  to  the  means  of  the  arrays. 

10.  Prove  that  for  truly  linear  regression  (7»y  —  r(Ty. 

11.  Show  that  for  truly  inear  regression  (7.ry2  =  <Tya(l  —  O- 

The  Coefficient  r  for  Non-linear  Regression.  It  has  been 
seen  on  page  85  that  r  is  always  too  small  for  any  but  strictly 
linear  regressions.  This  is  due  to  the  fact  that  the  summation 
22wxy(^  —  *)  (y  —  v)  involves  both  positive  and  negative  terms 
which  cancel  each  other  with  a  consequent  reduction  in  the  value 


112  INTRODUCTION    TO    MATHEMATICAL    STATISTICS 

of  the  summation.  If  the  regression  curve  is  carefully  drawn 
a  fair  idea  of  the  trustworthiness  of  r  can  be  obtained  by  ob- 
serving the  departures  of  that  curve  from  linearity.  A  more  ac- 
curate way,  of  course,  is  to  compute  both  r/  and  r  and  observe 
the  difference  in  value  of  the  two  constants. 


(.v  —  *)(y  —  y) 
Since     r  = 


the  size  of  r  varies  directly  as  the  value  of  the  summation  in  the 
numerator.  In  this  summation  the  large  values  of  both  x  and  y 
are  along  either  diagonal  and  hence  r  will  be  largest  numerically 
when  the  values  of  nxy  are  largest  along  a  diagonal.  If  the  fre- 
quencies tend  to  lie  along  one  diagonal  the  value  of  r  will  be 
positive;  along  the  other,  negative.  If  the  distribution  should 
exhibit  two  tendencies,  —  to  concentrate  along  both  diagonals  — 
the  cancellation  of  terms  with  opposite  signs  would  give  rise  to 
a  small  value  for  r.  Also  the  regression  may  be  markedly  non- 
linear, circular,  or  periodic  as  a  sine  curve,  so  that  the  straight  line 
fitted  to  the  means  of  the  arrays  is  practically  horizontal,  resulting 
in  a  very  small  value  for  r.  This  may  be  true  even  for  data 
which  shows  a  definite  tendency  for  the  frequencies  to  cluster 
closely  along  the  curve  of  means;  that  is,  it  is  possible  for  r  to 
have  a  small  value  even  though  the  data  shows  the  attributes  to 
have  in  fact  a  high  degree  of  correlation.  In  any  of  these  or 
similar  cases  the  correlation  should  be  determined  from  the 
correlation  ratio  which  is  not  affected  by  the  form  of  the  regres- 
sion. 

The  Most  Probable  Value  of  a  Characteristic  can  be  de- 
termined from  r.  Let  us  first  define  the  properties,  homoscedas- 
ticity  and  homoclisy. 

The  mean  standard  deviation  of  the  frequencies  of  an 
array  has  been  denoted  by  the  symbol  &ay-  where  Gay~  =  ay2 
(i  —  rf)  or  in  terms  of  r,  Gay2=&y2  (i  —  r2).  It  must  be  remem- 
bered that  these  are  mean  values  so  that  it  may  well  happen 
that  the  true  standard  deviation  of  an  individual  array  ma} 
differ  considerably  from  these  values.  A  distribution  in  which 
all  arrays  of  a  given  sense,  that  is,  all  y  or  all  .r  arrays  have 
the  same  standard  deviation  is  said  to  be  homoscedastic  with 
respect  to  the  arrays  of  that  sense. 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS  113 

It  has  been  assumed  that  the  frequencies  of  the  arrays 
are  so  distributed  that  the  means  and  the  modes  coincide  ; 
that  is,  so  that  the  mean  is  the  most  probable  value  of  the 
array,  but  this  may  not  always  be  even  approximately  true. 
The  arrays  of  a  distribution  are  said  to  be  homoclitic  when 
the  mean  is  the  most  probable  value  of  the  array. 

On  the  basis  of  the  just  preceding  definitions  it  may  be  said 
that  for  homoclitic  arrays  the  most  probable  value  of  y  corre- 
sponding to  a  given  value  for  x  is  found  from  the  equation 


or 


_ 

v  —  V  =  r--   (x  —  *). 
o\ 

A  knowledge  of  the  most  probable  values  is  of  little  im- 
portance unless  accompanied  by  information  of  the  dispersion 
about  that  value  ;  that  is,  of  the  standard  deviation  and  the 
probable  deviation.  Since  the  entire  theory  of  estimating  values 
of  a  characteristic  is  based  on  the  coefficient  of  correlation  the 
probable  deviation  of  y  when  obtained  from  the  regression  curve 


is  logically  0.67459  o-v  VO  —  r2)'  an<^  not  °-^7449  °V  V(J  —  *?2N 
(provided  the  arrays  are  homoscedastic,  otherwise  no  general 
formula  is  possible  and  the  dispersion  of  each  array  must  be  com- 
puted directly  from  the  data  of  the  respective  arrays).  Like- 
wise- the  probable  error  of  x  found  from  the  regressions  is 
0.67459  <TX  V(  i  —  ~^T»  witn  tne  same  restrictions  as  to  homosce- 
dasticity. 

If  the  three  conditions  of  linearity  of  regression,  of  homos- 
cedasticity,  and  of  homoclisy  are  satisfied  the  just  preceding 
theory  of  estimating  the  value  of  a  variable  characteristic  is  com- 
plete and  practically  valuable.  In  ordinary  distributions  these 
conditions  are  likely  to  hold  at  least  approximately  so  that  when 
intelligently  applied  the  theory  is  of  importance.  In  every  case 
the  regression  curve  should  be  determined  graphically  and  both 
j]  and  r  computed  and  the  difference  in  their  values  noted,  and 
the  test  for  linearity  applied.  If  there  is  doubt  as  to  the 
homoscedasticity,  the  standard  deviations  can  be  computed  di- 
rectly from  the  arrays  in  question  and  the  probable  deviations 
determined  from  the  resulting  values  instead  of  from  the  pre- 


114  INTRODUCTION    TO    MATHEMATICAL   STATISTICS 

ceding  formula.  The  question  of  homoclisy  is  usually  disre- 
garded though  wide  departures  should  be  noted  and  taken  into 
consideration. 

Exercises. 

12.*     What  is  the  most  probable  weight  of  a  student  of  height  70 
inches  ? 

13.  What  is  the  most  probable  height  of   a  student  of  weight  132 
pounds? 

14.  What  is  the  most  probable   rainfall  of  a  month  with  a  mean 
temperature  of  54  degrees? 

15.  What  is  the  most  probable  top  beef   cattle  price   for  a  month 
with  a  top  hog  price  of  $8.25? 

16.  Compute  the  probable  deviations  from  the  most  probable  values 
of  Exercises  5,  6,  7,  and  8. 

17.  Discuss  the  practical  reliability  of  the  preceding  estimates.     In 
how  far  is  the  probable  deviation  a  trustworthy  index  of  this  reliability? 

If  two  distributions  are  superimposed  the  value  of  r  for  the 
combined  group  is  connected  with  the  constituent  r's  by  the  following 
relation. 

N,N2 


N 

The  proof  of  this  equation  is  left  as   an  exercise. 
Two    superimposed    distributions    with    the    same    means    have    the 
simple  relation  of  connection  with  the  r's: 


form  which  the  effect  of  various  mixtures  of  data  can  be  readily  traced. 
For  instance,  if  the  second  distribution  has  a  constant  frequency  for  each 
subgroup,  r2  is  zero  and  the  value  of  r  is  smaller  than  that  of  n  in  the 

Ni<ry2<rxl 
proportion      -     .     That  is,  adding  a   constant   to    each  frequency 

N'Vxi 

decreases  the  value  of  r. 

The  effect  of  multiplying  each  frequency  is  readily  determined. 
Let  MX  be  replaced  by  awx  and  ny  by  any. 

22nxv   .   x   .   y  . 

N 

r=  —  becomes 

2nX    2V2 

N    "    N 


*  Exercises  12,  13,  14,  15,  refer  to  data  already  given. 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS 

g22nxy  .  x  .  y  , 
aN 


aN  aN 

22n      .  x  .  V 


as  before. 


That  is,  multiplying  each  frequency  is  without  effect  on  the  value  of 

Correlation  of  indices.  A  mathematical  measure  of  spurious  cor- 
relation will  now  be  derived. 

Let  us  take  the  two  series  of  measurements  x  and  y  and  let 
/xy  —  x/y  be  the  corresponding  series  of  indices. 

The  mean  /  cannot,  except  under  certain  conditions,  be  obtained  by 
dividing  the  mean  of  the  x's  by  the  mean  of  the  y's.  For,  by  definition 

1       x 

N       y 
Transferring  to  the  respective  means,  x  and  y,  as  origins  we  have 

1     x  +  dx 
7xy  —    --  2 ,  where   the  5's   denote  merely  the   new  variables 

N      y  +  8y 

measured  from  the  respective  means. 
On  rearranging 


Sx      ' 

1 

X       ~X 

7 

2 

XT 

N 

Sy     y 

y 

N      y  x       y        xy 


as  far  as  second  terms. 
But  25*  =  25y  :=  o  • 


—  Nff     and  21  — 


Il6  INTRODUCTION    TO    MATHEMATICAL    STATISTICS 


Hence,  Jn  = 


x    (  ff  r    a  <r     ~\ 

r  *  y  Jxy   x   y     I 

./„  =  -   l  +  _ __ 

y\         y  xy      ) 


xy 


This  formula  shows  that  the  error  in  assuming  ^xy  — -/xy  may  be 
considerable.  It  is  smallest  where  the  data  is  perfectly  correlated  and 
largest  for  uncorrelated  material. 

Exercises. 

18.     Let  indices  be    formed    from   the  two   series   of   measurements 

1234567 
1343274 

Find  the  value  of  /xy  and  compare  with  /.~~ 

The  Standard  Deviations  Gl  of  the  index  /yx  is  given  by  the  equa- 
tion 


(Vx"-      0V  CT.r      TV'"! 

„=!„•     -  +  —  -*'•-•-     • 

(.*?      r         x     y  ) 


The  standard  deviation  will  be  computed  about  the  corrected  mean. 
We  have 


x  + 


y 

On  dropping  the  cross   term, 


{(          8V}     f  S\^    -1         C         ya  ff          <r  *    }    }   3 

1+-       1+-       -    I--I-L+JL-     1 
I         x}    (          y)  [          -xy          y2  )l 

f  "x  V 

I  — — — \-  squared  terms  I 

I*        y  ) 

f  GX*        &y  &  r  ffy  ^ 

=^bv-2^] 

Exercises. 

19.     Prove  that  exactly  the  same  formula  is  obtained  for  the  standard 

deviation  about  the  unadjusted  mean   —   . 

y 

The  theorem  just  stated  as  an  exercise  shows  that  in  so  far  as  the 
standard  deviation  of  a  distribution  of  index  number  is  concerned,  it  is 
immaterial  whether  the  index  of  the  means,  x/y  is  corrected  or  not. 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS  117 

By  a  method  very  similar  to  that  employed  in  the  case  of  the  stand- 
ard deviation,  it  may  be  shown  that  the  coefficient  of  correlation  be- 
tween indices  is  given  by  the  formula 

a     o  ff     ff  ff     ff  ff     <r 

x      z  x      w  y      z  y      w 

T  V  —   V 

'  xz  '  xw  '  yz  7*yw 

x  ~z  x    w  y    "2  ~y     w 


rm  = 


i*7 


y2  xy         s?          -a?  zw 


When  the  four  variables  x,  y,  z  and  w  are  uncorrelated  the  value  of 
s  zero.     When  the  bases  are  constants  <ry  =  <TW  =  0,  and  we  have 


That  is,  dividing  each  value  of  a  characteristic  by  the  same  constant 
does  not  affect  the  value  of  the  coefficient  of  correlation. 

As  a  special  case  of  this  result,  if  the  absolute  values  of  two  char- 
acteristics are  correlated  the  degree  of  correlation  is  not  changed  by 
expressing  the  measurements  as  percent  s. 

When  the  bases  y  and  w  are  identical  /"  l.l  takes  the  value, 

'x    **  *x    ay  °y    «,. 

--  -»*  —  —  •  —  Txy  ---  ry*-\  -- 
x    M  x    y  y    2  7 


jf^»  ~^r   ff**z     )  jjv     *r    ***?     } 

M\  —  +-     -2--r*A    \1      -+-     -2  --  ry*\ 

P         ?  x   I  [    ?  ?  z 


y 
Now  let  x,  y  and  z  be  entirely  uncorrelated.     Then 


a 


1:1  = 


This    last    value   of    T \.\  may   even    equal    0.    5.    in    the    case   when 


Il8  INTRODUCTION    TO    MATHEMATICAL    STATISTICS 

Hence  by  dividing  each  deviation  by  a  third  variable,  it  is 
possible  to  introduce  correlation  into  strictly  uncorrelated  ma- 
terial to  as  great  an  extent  as  0.5. 

Care  must  therefore  be  taken  in  dealing  with  index  num- 
bers that  the  full  value  of  r  is  significant  for  the  absolute 
values  of  the  measurements.  By~  computing  r  from  the  formula 


r 


x       jr  ~*      f 

the  value  of  the  greatest  possible  degree  of  spurious  correlation  is 
obtained.  A  value  of  r  greater  than  this  value  is  certainly 
significant ;  a  value  less  may  be  significant  but  must  be  accepted 
with  caution. 

Since  by  the  formula  the  spurious  correlation  is  zero  when 
the  standard  deviation  or  variability  of  y  is  zero,  it  follozvs  that 
the  base  of  a-  system  of  index  numbers  should  be  as  nearly  con- 
stant as  possible. 

A  theory  of  spurious  correlation  might  be  developed  for  the 
correlation  ratio  but  the  algebraiac  details  are  so  much  more 
workable  for  the  correlation  coefficient  that  it  would  be  hardly 
worth  the  extra  effort.  It  is  conceivable  that  such  a  theory 
would  be  practically  necessary  but  it  is  unlikely  because  after 
all  only  approximate  results  are  valuable.  There  would  be  little 
of  value  in  attempting  to  measure  the  degree  of  spurious  cor- 
relation with  precision. 

It  must  be  remembered  that  the  matter  of  spurious  correlation  is 
essentially  one  of  interpretation.  The  question  is  what  does  correlation 
mean.  The  correlation  is  actual  and  real  for  the  indices  but  it  may  be 
spurious  in  so  for  as  the  absolute  values  of  the  measurements  are  con- 
cerned. 


CHAPTER  XIII. 
THE   METHOD    OF   CONTINGENCY. 

THE  CORRELATION  OF  NON-QUANTITATIVE 
CHARACTERISTICS . 

The  Mean  Squared  Contingency.  —  Both  the  correlation 
ratio  and  the  correlation  coefficient  are  based  on  the  variation 
in  the  means  of  the  arrays ;  the  first  directly  and  the  second 
through  the  straight  line  fitted  to  the  points  located  by  the 
rrieans.  Another  method  of  measuring  the  degre  of  connection 
between  the  attributes,  described  and  illustrated  in  this  chap- 
ter, is  based  on  elementary  notions  of  probability.  It  may  be 
stated  in  beginning  the  discussion  of  the  method  of  contingency 
that  not  the  least  important  value  of  a  study  of  the  subject  is  the 
additional  insight  which  it  gives  into  the  fundamental  nature  of 
correlation. 

In  Table  I  the  distribution  of  height  without  reference  to 
weight  is  given  by  the  total  frequency  of  the  y  arrays;  that  is, 
by  the  totals  at  the  foot  of  the  table,  and  the  distribution  of 
weight  without  reference  to  the  distribution  of  height  by  the 
column  of  sums  at  the  right  in  the  table. 

When  there  is  no  tendency  for  certain  weights  to  be  most 
often  associated  with  certain  heights,  the  frequency  of  a  sub- 
group should  be  proportional  to  the  total  frequencies  of  its 
two  arrays.  Thus  imagine  the  frequencies  of  the  subgroups 
erased  from  Table  VIII  of  Chapter  VII  and  then  filled  in  en- 
tirely at  random;  that  is,  without  bias  or  selection.  Since 
110/750  of  the  total  frequency  of  the  distribution  appears  in 
the  height  array  of  weight  type  137;  that  is,  since  no  in- 
dividuals out  of  750  are  of  weight  137,  it  is  logical  to  assume 
that  this  height  array  contains  1 14/750  of  the  frequency  of  each 
array  which  it  crosses  The  frequency  of  the  subgroup  (68,- 
137),  for  instance,  should  be  110/750  of  126.  And  in  general, 
when  the  individuals  are  placed  at  random,  the  frequency  of  a 

w-x  -  »y 
subgroup  is  given  by  the  formula — .     For  the  y  array  of 

(119) 


120  INTRODUCTION    TO    MATHEMATICAL    STATISTICS 

K* 

type  .v  contains  —     of  the  frequencies  of  each  array  which  it 

crosses.     The  frequency  of  the  JT  array  of  type  y  is  n*.     Hence 

wx 
the  subgroup  of   intersection  has  a   frequency,  -  .   ny,  which 

jV 
nx  .  ny 


equals 


N 


Now  if  in  the  actual  distribution  the   frequency  of  a  sub- 
group, »xy,  is  larger  or  smaller  than  the  random  selection  fre- 

Hy 

quency  given  by  the  formula  nx  .  —  -  ,  the  divergence  must  be  due 

to  the  presence  in  the  data  of  a  tendency  for  certain  values  of 
the  attributes  to  be  most  often  associated  and  hence  the  total 
extent  of  this  divergence  is  a  measure  of  the  degree  of  the  asso- 
ciation or  correlation  in  the  data.  This  method  of  measuring 
correlation  is  called  the  method  of  contingency. 


nx  .  ny 
The  difference  nxy  —  is  squared  to  prevent  the  can- 

N 

celling  of  positive  and  negative  values.     Since  only  the  relative 
size  of  the  differences  is  significant,  this  square  is  divided  by  the 

Hx    .    Hy 

above  random  selection  frequency .    On  summing  all  such 

values,  we  have  the  mean  square  contingency  <£.2 

«xy ~ 


where  N&2  =  ^ 


On   expanding   and   reducing,   this   summation   is   arranged 
in  a  more  convenient  form  for  computation.     We  have 

nxny 


N   ]        *r  H" 

—  ft- 2MXV 

7/V//V 


N 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS  121 


f 
rV~ 


>rxy 

and  hence  22  •    =  A^2  —       —  2N  -j-  N, 

nxny  nxn-y 

N 

nxny        i  i 

since  22  -  -  =  —  2wx  2«v  =  -—  2«*  N  =  S«x  =  N. 

N         N  N 

n~xy 
Therefore  <f>2  =  2 i  . 


The  probable  deviation  of  <£  is  discussed  at  length  by  Pear- 
son and  Blakeman*. 

Exercises. 

1.  Compute  the  value  of  0  for  the  data  of  Table  VIII. 

2.  Why  not  divide  the  square  of  the  difference  for  each  sub-group 
by  the  actual  frequency  of  the  sub-group  instead  of  the  frequency  under 
the  assumption  of  no  correlation? 

Properties  of  <f> .      In  data  selected  entirely  at  random ;  that 
is,  where  nxy  =  —         for  all  values  of  .r  and  v,  the  value  of  <f> 

is  of  course  zero.  It  does  not  necessarily  follow,  however,  that 
for  absolutely  uncorrelated  material;  that  is,  for  data  having 
rjx  =  rfy  —  o,  the  value  of  <£  must  be  zero. 

A  moment's  consideration  will  show  that  the  greatest  value  for 

22  -   -  .  taken  over  the  subgroups  of  any  one  array,  is  unity  and 


that  this  greatest  value  cannot  be  attained  unless  the  subgroup  of 
intersection  is  the  only  subgroup  with  non-vanishing  frequency  in 
either  of  the  two  arrays  intersecting  in  that  subgroup.  It  follows 
that,  if  the  distribution  is  not  square,  the  number  of  arrays  giving 
the  maximum  value  cannot  be  greater  than  the  number  of  the 
longer  arrays.  Hence  in  symbols,  if  r  and  j  are  the  numbers  of 
arrays  of  the  respective  attributes  and  r  =  j  or  r  <  s,  the  greatest 
value  for  <£2  is,  r —  i. 


*Biometrika.    Vol.  V,  p.  191  et  seq. 


122  INTRODUCTION    TO    MATHEMATICAL    STATISTICS 

For  illustration,  in  the  table 

abed 
e  f  g  h 
i  j  k  1 

at  least  one  horizontal   array  must   have  more   than   one  non- 
vanishing  subgroup  frequency.    Let  this  table  be 

a  o  o  o 
o  f  o  o 
o  o  k  1  • 

n2xy        a2        f2            k2  I2 

Then  2 =  —  H 1 (- 


nxny      a.  a    f.f     k(k+l)     l(k+l) 
k  I 


k+l       k+l 


Exercises. 

3.  Show  that  for  0  =0,  the  means  of  the  x  and  the  y  arrays  lie 
on  vertical  and  horizontal  lines  respectively. 

V 

4.  Show  by  actual  substitution  in  the  formula  <£*  =  2  --  1  that 


0  =  0  when  wxy  =  — 


N 


5.  Verify  the  just  preceding  theory  by  assigning  different  combina- 
tions of  values  to  the  symbals  a,  b,  c,  d,  e,  f,  g,  h,  \,  in  the  distribution : 

a  b  c 
d  e  f 
g  h  i 

6.  Do  the  same  for  the  distribution 

a    b     c    d    e 

f    g    h    i     j  . 

7.  Show  that  the  greatest  value  of  0a  for  a  table  of  the  form  of 
Table  VIII  is  13. 

8.  Give  an  algebraic  demonstration  for  this  theory  when  applied  to 
a  general  distribution  r  by  ^  fold. 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS  123 

The  greatest  disadvantage  of  ^  as  a  measure  of  correla- 
tion arises  from  the  fact  that  its  value  depends  on  the  number 
of  arrays  in  the  distribution  so  that  it  is  almost  entirely  useless 
for  purposes  of  comparison  Another  disadvantage  lies  in  the 
fact  that,  notwithstanding  the  logical  simplicity  and  directness 
of  the  theory  underlying  the  method  of  contingency,  in  prac- 
tice the  interpretation  of  variations  in  the  value  of  <£  is  a  mat- 
ter of  much  difficulty  For  instance,  when  <j>  equals  2.5  what  is 
the  significance  of  an  increase  of  0.5  in  its  value?  How  much 
greater  is  the  degree  of  closeness  of  association  in  the  latter 
case  than  in  the  first?  A  third  objection  is  that  for  a  large 
table  the  labor  of  computation  is  heavy. 

The  first  objection  above  is  partially  overcome  by  making  use 


of   the   coefficient   of  contingency,    \  —      —  .     This  constant  is 


given  added  prestige  by  the  following  relation.  It  may  be  shown 
that  for  a  finely  divided  distribution  of  a  particular  type  the  co- 
efficient of  contingency  and  the  coefficient  of  correlation  are  equal 
in  value.  Consequently  in  certain  forms  of  distribution  this  fur- 
nishes a  convenient  method  of  obtaining  the  value  of  r.  How- 
ever, care  must  be  taken  to  make  sure  that  the  assumptions 
essential  for  the  validity  of  this  theorem  are  approximated  to 
with  sufficient  closeness.  Ordinarily  it  is  better  to  make  use  of 
methods  ivhich  do  not  rest  on  so  extensive  assumptions. 

An  approximation  to  the  probable  deviation  of  the  co- 
efficient of  contingency  is  to  take  one  and  one-third  the  prob- 
able deviation  of  r. 

Exercises. 

0.  Compute  the  coefficient  of  contingency  for  the  data  of  Table  VIII 
and  compare  with  the  value  of  r  already  computed. 

10.  Do  the  same  for  the  data  of  Table  IX,  Chapter  VII. 

11.  By  combining  arrays  in  the  distribution  of  Table  VII  and  com- 


A 

\ 


puting  the  successive  values  of  \  show  the  effect  of  different  widths 

0L> 

of  classes  on  the  value  of  this  constant. 

12.  Show   that   the    coefficient   of    contingency   is    smaller   than    the 
value  of  r  computed  by  the  method  of  moments. 

13.  Compare   the    reliability    of   the    coefficient    of    contingency    for 
highly  and  for  slightly  correlated  data. 


124  INTRODUCTION    TO    MATHEMATICAL    STATISTICS 

14.  Compare  the  labor  required  to  compute  the  value  of  <t>  with  that 
for  rj, 

In  concluding  this  part  of  the  discussion  of  the  method  of 
contingency  it  may  be  stated  that  when  the  attributes  can  be 
definitely  measured  there  is  no  practical  advantage  in  computing 
the  value  of  <f>. 

Non-Quantitative    Characteristics.      Because    the    formula 

n'2xy 

2,2  —    -  does  not  contain  the  deviations  x  and  y  and  contains  only 
ntny 

the  frequencies  of  the  subgroups  it  can  be  applied  to  distribu- 
tions in  which  it  is  impossible  or  undesirable  to  assign  numerical 
values  to  the  deviations ;  for  instance,  a  distribution  of  hair  and 
eye  color,  of  degrees  of  intelligence  in  drawing  and  music.  Such 
distributions  are  said  to  involve  characteristics  not  quantitatively 
measured  or  measurable. 

Thus  in  its  fundamental  theory  the  coefficient  of  contingency 
applies  with  equal  validity  to  quantitative  and  to  non-quantita- 
tive data.  Moreover,  since  the  number  of  classes  in  the  case 
of  non-quantitative  distributions  is  ordinarily  small  the  labor 
of  computation  is  not  unduly  heavy,  and  hence  the  coefficient 
of  contingency  is  of  greater  practical  importance  for  this  kind 
of  data  than  for  quantitative  data.  However  it  will  now  be 
shown  that  for  non-quantitative  distributions,  the  correlation 
ratio  is  a  more  convenient  and  satisfactory  measure  of  correla- 
tion than  is  the  coefficient  of  contingency. 

A  correlation  problem  very  similar  to  that  arising  from 
non-quantitative  data  is  the  finding  of  the  degree  of  correla- 
tion when  the  measurements  of  the  attributes  in  quantitative 
data  are  classified  into  very  broad  classes ;  to  find,  for  instance, 
the  extent  of  the  tendency  for  under-height  and  over- weight  to 
be  associated.  Further  than  the  effect  that  so  broad  classes  may 
have  in  producing  errors  in  the  results  obtained  by  the  formula 
for  77  there  is  no  theoretical  objection  to  the  direct  application 
of  the  theory  of  the  correlation  .ratio  to  a  distribution  obtained 
by  grouping  into  broad  classes. 

However,  the  theory  of  the  correlation  ratio  does  not  ap- 
ply directly  to  strictly  qualitative  data  and  for  that  reason 
7<r  shall  justify  its  use  for  such  distributions  by  shounng  thad 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS 


125 


in  a  very  important  form  of  distribution,  the  t^vo  by  two  table,  rj 
and  <f>  arc  identical  and  that  in  other  ordinarily  oc curing  cases\ 
the  values  of  the  two  constants  are  highly  correlated. 

Exercises. 

1<).     Arrange  the  data  of  Table  VIII  in  the  following  form  and  com- 
pute the  values  of  ??  and  0. 

Height. 


1 

"5 

Under  68 

Over  68 

Totals 

Under  137 

Over  137 

Totals 

The  Four-fold  and  the  Nine-fold  tables.  We  shall  now 
derive  the  formulas  for  r/  and  <£  for  a  2  x  2  table  and  obtain 
the  computation  formulas  for  >/  for  a  3  by  3  table.  The  same 
method  might  be  employed  to  derive  special  formulas  for  each 
type  of  table.  In  the  absence  of  special  formulas  the  general 
formula  for  >?  can  be  applied  directly. 

Let  us  take  the  four-fold  table, 


i  + 


M.1  +  2M:a          //.,  W:1  + 

We  have  v 

N  N  \ 

;/,,  n..., 

Similarly,  v,  =  I  +  •        .  and  v.,  —  i  +  - 
;/,  //, 

Substituting  these  values  in  the  formula. 


A'T;>/y  —   ;/ ,  (  \~,  —  y ) "  +  n., ( y,  —  y )  -,     where    Gmy-    is    the 
mean  s([iiared  deviation  of  the  means  of  the  arrays. 


126  INTRODUCTION    TO    MATHEMATICAL   STATISTICS 

We  have  after  some  detailed  reduction, 


Nnl  .  n 


Also  N&v2  = 


N       ' 


Therefore  9  =  -  -  • 

From  the  formula  No--  =•  ^  —         -  I    it  is   readily  shown 

f 
by  direct  computation  that  <j>  =.  - 

The  equality  of  <f>  and  ?;  for  the  fourfold  distribution  is 
therefore  demonstrated. 

For  the  nine-fold  table, 


we  have  by  a  reduction  similar  to  that  for  the  2  by  2  table, 


.  2  .  n 

1_^C  L  -  /  . 

A^ 
then,  on  substituting  for  yx  and  y, 

nX9  H-  2wXo         "I  ~ 
-/ 


—  2/        --- 

N 


Therefore, 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS  127 

2MX3)2  ^    2HX2-f  2WX3 


—  y)2  =  *- 


-  27VZ2 


N 


\  =  2  -  -  -  Nl2. 


Similarly  2«y(v  —  y)2  =   N(l  —  \-)-\-2n .,. 
S  — -^-  —Nl2 

W- 

Hence  nz  —  — 


—  l2)  +2n. 


Exercises. 

17.  Compute  the  value  of  ^  from  the  following  distribution  of  the 
variation  in  receipts  and  prices  from  month  to  month  of  live  hogs  at 
Union  Stock  Yards,  Chicago,  from  1901  to  1914. 

Receipts. 


—  50 

—50   50 

50 

—  25 

10 

7 

24 

<L» 

£ 

25  25 

20 

32 

26 

25  

37 

\   5 

6 

The  original  formula  for  0, 

»Vv 

0'  —  22 1 

»x»y 

is  probably  in  the  case  of  the  3x3  table  as  convenient  as  any 
for  the  computation  of  that  constant. 

It  is  immediately  evident  that  in  general  r;  and  <f>  cannot 
be  equivalent  for  a  table  larger  than  four-fold,  because  there 
are  two  ?/s  for  each  distribution  and  only  one  <f>.  The  follow- 
ing theorems  may  be  stated. 

1.  When  <£  =  o,  the  value  of  rj  for  y  on  .r  and  for  ,r  on 
y  arc  both  zero;  that  is,  -rjy  =:->yx  —  o. 

2.  When  r)y  =  r/x  =  o,  it  may  ordinarily  be  expected  that 
<f>  will  be  practically  zero  but  it  is  not  absolutely  necessary  that 
such   be   the  case. 


128  INTRODUCTION    TO    MATHEMATICAL    STATISTICS 

3.  When  only  one  rj  is  zero,  it  is  most  likely  that  $  will  be 
small  in  value. 

4.  When  <f>  takes  the  maximum  value,  yjy  =  yK  =  i  . 

5.  When  r)y=,rjx—  i,  </>  takes  the  maximum  value. 

6.  When  one  -q  only  is  unity  it  is  most  likely  that  the  value 
of  <f>  twill  riot  differ  greatly  from  the  maximum. 

7.  There  is  a  close  correspondence  bettveen  the  values  of 
<j>  and  the  ys  for  data  of  all  degrees  of  correlation. 

Discussion  of  the  Theorems,     On  substituting  the  rela- 

Wx  Hy 

tions  «xy  =  —  -   in  the  formula  for  r/y  and  for  T/X  it  follows  im- 

N 

mediately    that     y  =  yl  =  y2  —  y  and  hence  that  -rjy  =  r;x  =  o  for 
<f>  =  o. 

In  regard  to  theorem  2,  it  will  now  be  shown  that  the  nine 


nK  ny 
relations,  wxy  =  —    —  ,   which  result   for  a  nine-  fold  table   when 

N 

<j>  =.  o,  can  be  reduced  to  four  independent  relations,  which  re- 
sult when  (f>  =  o.  That  is,  if  there  are  four  such  relations  the 
other  five  must  hold  true  and  the  value  of  $  is  necessarily  zero. 
In  other  words,  the  vanishing  of  $  imposes  four  and  only  four 
restrictions  or  conditions  on  the  data  of  a  3  by  3  table. 

//^/.j  w2w.i  MI  •  n.  2 

For    if,    «„  =  -      -,     »01  =  -       —  ,    w12  =  -        -'-,     and 

N  N  N 

n2n.<,  «.{  .  ?/.! 

w     =  —    -  it  follows  that  ;i31  =  -       -'-  .     Let  us  substitute  the 

N  N 

equivalents  for  nlt  and  w21  in  the  equation  »31;=  «:1  —  nlt  —  w21. 
This  substitution  gives 


N  : 


AT 
and  similarly  for  the  remaining  relations. 

The  vanishing  of  r;y  implies  the  three  relations. 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS  I2Q 


It  can  be  readily  shown  that  only  two  of  three  relations  are 
independent.  That  is,  if  the  first  two  relations  hold,  the  third 
is  necessarily  true. 

If  rjx   as   well   as   yjy  vanish  the   three   additional   relations 

w21  -f  2«31  ;/22  +  2w32  «32  +  2w33         «2  .+  2«3 

—  =  -   are    im- 
w:1  n:2  w:3  AT 

plied.  Here  again  only  two  of  the  three  additional  relations  are 
independent  and  of  the  six  relations  implied  by  the  vanishing 
of  both  rjy  and  TJX  it  is  only  a  matter  of  algebraic  detail  'to  show 
that  only  three  are  independent.  That  is,  the  vanishing  of  both 
rjK  and  r/y  imposes  one  less  condition  on  the  data  than  does  the 
vanishing  of  <£.  And  hence  it  is  not  necessarily  true  that 

<£  —  o  when  r/x  —  r/y  =  O. 

As  to  the  maximum  values  for  these  constants,  the  rela- 
tions r/y  —  T/Z  =  i  require  that  there  be  but  one  non-vanishing 
frequency  in  each  array  of  either  sense  and  hence  the  condi- 
tion for  a  maximum  value  for  $  is  satisfied.  The  converse  rela- 
tions are  evidently  true. 

For  only  one  r/  equal  to  unity,  however,  the  data  might  be 
arranged,  for  instance,  in  the  form  of  the  table, 

a     o     o 

ooo 

o     b     c, 

when  <j>2  would  not  have  the  maximum  value. 

If  a  large  number  of  distributions  were  made  up  from  the 
same  population  and  the  values  of  rj  and  of  <£  computed  for  each 
distribution,  it  would  be  found  that  in  the  long  run  a  large  value 
of  i)  was  associated  with  the  larger  values  of  <f>  and  vice  versa. 
But  to  obtain  a  formula  for  the  correlation  of  r/  and  <£  is  a  mat- 
ter of  considerable  algebraiac  detail  and  the  resulting  formula  is 
so  complicated  that  it  is  practically  worthless*.  For  this  reason 
the  algebraiac  discussion  of  Theorems  2,  3  and  6  is  not  given  in 
the  complete  form. 

We  have  outlined  the  method  of  showing  that  the  value  of 
rj  for  a  non-quantitative  distribution  has  a  close  connection  to  the 

*  Compare  Blakeman,  "The  Probable  Error  of  the  Coefficient  of 
Contingency"  loc.  cit. 


I3O  INTRODUCTION    TO    MATHEMATICAL    STATISTICS 

value  of  <f> ;  that  is,  that  rj  cmd  <f>  are  highly  correlated,  for  such 
data  and  hence  the  correlation  ratio  may  be  used  to  measure  the 
degree  of  correlation  or  association  in  the  data  with  all  the  as- 
surance that  attaches  to  the  method  of  contingency.  It  is  of  dis- 
tinct practical  advantage  to  have  one  coefficient  or  index  of  cor- 
relation for  all  kinds  of  data  and  for  that  reason  the  coefficient 
of  contingency  is  not  greatly  used  in  practice. 

Caution  is  necessary  at  one  point,  however,  for  data  divided 
into  only  a  few  classes  does  not  convey  the  same  amount  of  in- 
formation regarding  the  correlation  between  the  characteristics  as 
does  the  more  detailed  material  and  hence  not  the  same  degree 
of  confidnce  can  be  placed  in  the  computed  value  of  any  constant 
derived  from  the  less  detailed  table.  For  this  reason  compari- 
sons of  the  values  of  correlation  measures  between  different  forms 
of  distributions  must  be  carefully  made  and  due  account  taken  of 
the  fact  that  for  the  small  table  the  results  do  not  warrant  the 
same  degree  of  confidence  as  do  the  results  from  the  finely 
divided  table. 


and  W22  = 

N 

the  remaining  five  relations  of  the  same  type  hold. 

19.  Show  that  if  *?2  =  0  the  vanishing  of  *7X  imposes  only  one  addi- 
tional condition  on  the  data. 

20  Show  that  if  ny  ='?x  =  0,  the  frequencies  of  the  nine  fold  table, 
can  be  expressed  in  terms  of  the  marginal  sums  and  frequency  of  any 
one  sub-group. 

21.  Show  that  in  the  distribution  2    4    2  i?y  =  nx  =  0  and   <t>  ±  0. 

1     1     1 
242 

22.  Construct  a  fictitious  table  having  *?y  =  1  and  <t>  not  having  a 
maximum  value. 

23.  Investigate  the  relations  between  i\  and  <f>  for  a  2  x  3  table. 


APPENDIX  I. 

Introduction.  The  generalized  frequency  curves  of  Pear- 
son are  so  diverse  in  shape  that  a  curve  of  this  class  can  be 
found  to  fit  any  ordinary  statistical  distribution.  By  the 
following  methods  the  fitting  of  a  Pearson  curve  is  reduced 
almost  entirely  to  a  matter  of  routine  substitution  in  formulas, 
so  that  the  practical  statistician  can  make  extended  use  of  the 
curves  without  great  familiarity  with  their  theory. 

This  discussion  as  designed  both  to  present  the  working  methods  of 
the  generalized  frequency  curves  and  to  give  the  statistician  who  has  a 
minimum  of  acquaintance  with  the  higher  mathematics  some  degree  of 
familiarity  with  the  underlying  theory.  The  demonstrations  are,  for  the 
most  part,  omitted.  Many  of  the  exercises  have  to  do  with  the  omitted 
theorems  and  derivations. 

In  developing  the  theory  of  the  generalized  frequency  curves 
it  is  logical,  as  well  as  practically  convenient,  to  start  with  the 
normal  curve  and  consider  the  general  distribution  as  a  mod- 
ification* of  the  normal  type  of  distribution. 

The  Slope  Property.  The  particular  modification  which 
leads  to  the  frequency  of  Pearson  is  obtained  by  generalizing 
the  slope  condition  of  the  normal  curve.**  The  slope  of  a 
curve  at  a  given  point  is  the  tangent  of  the  angle  which  the 
line  touching  the  curve  at  that  point  makes  with  the  X-axis. 
In  the  case  of  the  normal  curve,  the  ratio  of  the  slope  to  the 
ordinate  is  negatively  equal  to  the  abscissa  of  the  point. 

This  slope  property  is  generalized  by  taking  the  ratio 
equal;isot  to  —  x,  but  to  —  (x  +  a]  (b  +  ex  +  dx2}  where  a,  b, 

are  equalTnot  to  —  x,  but  to  —  •=—  —7-^  where  a,  b,  c,  d,  are 

^^  o  -f-  ex  -f-  a.v- 

constants.     The  slope  of  a  curve   is   ordinarily   denoted  by  the 
dy 

symbol  — • 

dx 

*  Compare  Edgeworth,  Jour.  Roy.  Stat.  Soc.  Also  West,  "On  the 
Translated  Normal  Curve,"  Ohio  Journal  of  Science,  Dec.,  1915. 

**  First  extensively  treated  by  Pearson  in  the  article  "Skew  Varia- 
tion in  Momogeneous  Material"  Phil.  Trans. 

(131) 


132  INTRODUCTION    TO    MATHEMATICAL    STATISTICS 

In  this  notation  the  generalized  slope  property  is  expressed 
by  the  equation. 

i  dy  x  +  a 

y  d.v         b  -f-  ex  +  dx2 

The  Constants,  a,  b,  c,  d.     The  statistical  significance  of 
each  of  the  constants,  a,  b,  c,  d,  can  be  readily  determined. 

In  Chapter  IV,  it  is  shown  that  the  slope  of  a  frequency 

'dy 

curve  is  zero  at  a  mode.     Since  -1-;  that  is,  the  slope,  is  zero 

dx 

when  x  =  —  a,  the  constant  a  determines  the  position  of  the 
mode.  The  mode  is  therefore  at  a  distance,  —  a,  from  the  mean. 
As  explained  in  Chapter  V.  a  is  thus  a  measure  of  the  skewness, 
of  the  lack  of  symmetry  of  the  distribution.  For  a  symmetrical 
distribution  a  is  evidently  o. 

When  both  c  and  d  are  zero  the  generalized  slope  equation 

x  +  a 

is  merely  the  normal  slope  equation  with  .r  replaced  by  -       — . 

b, 

This  leads  to  the  normal  curve, 

(x-f  a)-' 

y  =  k  .  e  ,  where  k  is  a  constant. 

Comparing  this  equation  with  the  standard  normal  equation, 


we  see  that  b  equals  2o-2  multiplied  by  a  constant. 

The  degree  of  symmetry  of  the  curve  is  indicated  by  the 
value  of  c  as  well  as  by  the  value  of  a.  For,  when  x  is  positive, 
the  term  ex  is  added  in  the  denominator  and  when  x  is  negative 
it  is  subtracted.  This  tends  to  make  the  frequency  curve  steeper 
to  the  left  than  to  the  right  of  the  origin,  and  hence  the  curve 
must  extend  farther  to  the  right,  that  is,  the  curve  must  be  skew.* 

But  it  was  seen  in  Chapter  V.  that  /?!  is  the  fundamental 
measures  of  skewness.  Therefore  both  a  and  c  must  contain  f3l 
as  a  factor. 

When  x-   is   small   the   constant  d  has   little   effect   on   the 


*  See  page  57,  Chapter  V. 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS  133 

slope,  but  for  the  extremities  of  the  curve  where  x  and  hence 
d  x~  is  large  the  slope  is  reduced  by  a  large  value  of  d.  It  will 
be  seen  that  d  depends  largely  on  /?2. 

The  Types  of  Curves.  W7e  may  now  discuss  the  distinct 
types  of  curves  that  possess  the  slope  properties  of  the  general- 
ized slope  equation.  Distinct  types  of  curves  result  according  as 
the  denominator,  b  +  ex  -f-  d.r'2,  has  two  distinct  factors,  two  co- 
incident factors,  or  has  no  factors. 

With  two  distinct  factors  the  slope  equation  can  be  written 

i     d  y  x  +  a  x  +  a 

IV     • 


v     ax  I  +  C.Y  +  d.\'-  (r  +  -r)  (r2  —  *) 

where  &  is  a  constant. 

By  the  usual  mathematical  methods  we  then  have 


k(a  —  ra) 

y  =  3-   Oi  +  -v)  -  -  •    0'2  —  *)  —  (A) 

>'i  +  r2  ^  +  r2 

where  y   is  the  constant  of  integration. 

By  a  simple  transformation  and  rearrangement,  this  equa 
tion  can  be  reduced  to  the  form  of  Pearson's  first  type,  namely 


(.1- 
'  + 


Exercises. 

1.  Carry    through    in    detail    the    necessary    transformations    to    de- 
termine the  equation  of  Type  I  from  equation  (A). 

2.  Perform  the  integrations  to  obtain  the  curve  of  Type  I. 

When  &J  and  a-2  are  equal  it  is  readily  shown  that  m^  =  m2 
and  the  equation  takes  the  form  of  Type  II  : 

IL 

When  one  root  of  the  denominator  b  +  ex  -\-dx~  is  indefi- 
nitely large,  that  is,  when  d  is  zero,  we  have,  from  the  theory 
of  the  exponential  e,  the  third  type  : 


m- 


134  INTRODUCTION    TO    MATHEMATICAL   STATISTICS 

This  equation  may  be  looked  upon  as  that  of  Type  I  with  a2 
indefinitely  large. 

The  curves  of  Type  III  are  especially  serviceable  because 
the  equations  are  simple  in  form  and  convenient  for  computa- 
tion. They  are  the  most  elementary  skew  curves. 

By  transforming  expression  (A),  in  a  manner  somewhat 
different  from  that  to  obtain  Type  I,  the  form  of  Pearson's 
sixth  type  is  readily  obtained.  It  is 

'  3'  =  y0  (a-  —  a)  '"=  x  -*i.  Type  VI. 

Exercises. 

3.  Obtain  the  equation  of  Type  II  by  direct  integration  from  the 
differential  equation. 

4.  Compare  Type  II  with  the  normal  curve. 

5.  Obtain  Type  III  directly  by  integration. 

6.  Obtain  Type   III   from    (A). 

7.  Compare  the  shape  of  Type  III  with  that  of  the  normal  curve. 

8.  Obtain  the  equation   of   Type  VI   directly  from  the   differential 
equation. 

10.     Is  Type  VI  geometrically  distinct  from  Type  I? 

When  two  roots  are  indefinitely  large  we  have  the  normal 
curve : 


which  is  called  simply  "Normal"  in  Pearson's  scheme  of  classifi- 
cation. 

With  two  coincident  roots,  the  slope  equation  becomes 
I     dy  x  -\-  a 

y     dx  (x  -\-  r)2 

y 

This  leads  to  the  form  y  —  y()x~ve       x  ,  TyPe  V. 

which  is  Pearson's  type  V. 

Exercises. 

11.     Derive  in  detail  the  equation  of  Type  V. 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS  135 

When   the   denominator   of    the   slope   equation   cannot   be 
factored  the  integration  is  performed  by  writing 

i       dy  x  -j-  a 

y'dx~  b  +  cx  +  dx2  ' 

c  c 

*-\ h0- 

2d  2d 


c  c  c- 

d\x-  -\ x  +  -      +  - 

d  4^L>        d        4<*2 

This  gives  —  xf_ 

(.I*2  \  ~m       —v  tan      o 
I+^/  ' 

Type  IV. 
which  is  the  form  of  Type  IV. 

Exercises. 

12.  Derive  in  detail  the  equation  of  Type  IV. 

13.  Derive  the   equation  of   Type   IV  by  transformation   from  the 
equation  of  Type  I. 

14.  Compare  the  form  of  the  equation  of  Type  IV  to  that  of  Type  III. 

If  y  is  zero  in  the  immediately  preceding  equation  we  have 
Pearson's  Type  VII. 

/          .r2\-»» 
V  =  -Vo  I   I  +         I 

V          a-2/  Type  VII. 

The  Intercepts.  The  intercepts  made  on  the  X-axis  by 
the  various  types  of  curves  can  now  be  examined.  The  follow- 
ing theorem  is  fundamental  in  the  theory  of  the  intercepts  of 
Pearson's  curves:  an  incommensurable  power  of  a  negative 
number  does  not  exist. 

Let  — N  denote  any  negative  number  and  ( — N)P  =  r  (cos  p* -\- 
V-^nTsin  p-ir)  where  V^— I  is  the  square  root  of  negative  unity.  Unless 
p  is  an  integer  sin  p*  is  not  zero  and  hence  ( — N)P  contains  V^l  which 
has  no  arithmetical  value.  Hence  powers  of  —  N  which  are  not  integral 
do  not  exist. 

In  Type  I  the  intercepts  are  —  a^  and  a2.  Since  #j  and  a2  are 
not  integers,  the  curve  stops  at  the  X-axis  and  there  are  no  points 
below  that  axis.  Indeed,  there  are  no  negative  ordinates  on  any 
of  the  curves. 


136  INTRODUCTION    TO    MATHEMATICAL    STATISTICS 

In  Type  II  the  intercepts  are  of  the  same  length  and  numeri- 
cally equal  to  a. 

In  Type  III  one  intercept  is  —  a  and  the  other  is  indefinitely 
large. 

In  the  case  of  the  normal  curve  both  intercepts  are  indefi- 
nitely large. 

In  Types  IV  and  VII  there  are  no  intercepts. 

In  Type  V  one  intercept  passes  through  the  origin  and  the 
other  is  indefinitely  large. 

In  Type  VI  both  intercepts  are  positive  or  both  are  negative. 

Ordinarily  the  type  of  curve  selected  should  have  intercepts 
harmonizing  with  the  natural  limits  of  the  range  of  the  data. 
For  instance,  data  necessarily  limited  in  either  direction  should 
be  smoothed  with  a  curve  correspondingly  limited.  However 
nearly  all  the  curves  are  practically  limited  in  range  because  the 
ordinates  soon  become  negligible,  so  that  the  matter  is  not  one 
of  great  importance ;  tho  a  somewhat  better  fit  is  likely  to  be 
obtained  with  a  curve  limited  in  accordance  with  the  data. 

Exercises. 

15.  Of  what  types  is  the  normal  curve  a  limiting  curve? 

16.  Distinguish  between   a   curve   with   indefinitely   large   intercepts 
and  a  curve  with  imaginary  or  non-existent  intercepts. 

17.  Show  that  there  are  indefinitely  more  curves  of  Types  I,  VI  and 
IV  than  of  Types  III,  V,  II  or  VII,  or  of  the  normal  curve. 

18.  Show  how  Type  I  can  be  said  algebraically  to  include  Type  IV. 

19.  Show  that  Types  I  and  VI  are  not  fundamentally  distinct. 

20.  Show  that  by  taking  all  combinations  of  sign  into  account  there 
are  three  distinct  classes  of  curve  under  Type  I. 

21.  Show  that  there  are  two  sub-classes  under  Type  II  according 
as  the  exponent  m  is  positive  or  negative. 

22.  Show  that  there  are  two  classes  under  Type  III. 

23.  Is  there  more  than  one  general  form  of  curve  under  Type  IV? 
Under  type  V? 

24.  Discuss   the   curves   of   Type  VI   as   to  the  existence  of   sub- 
classes within  the  Type. 

25.  What  types  of  these  curves  have  asymptotes? 

26.  Do  all  the  curves  have  a  mode? 

27.  Find  the  points  of  inflexion  for  each  type. 

The  Criterion  K.  Since  the  separation  into  types  depends 
primarily  on  the  nature  of  the  roots  of  the  quadratic, 
b  +  ex  -f-  cfcr2,  the  discriminant  of  this  quadratic  constitutes  a 


INTKonrCTION    TO    MATHEMATICAL    STATISTICS  137 

criterion  of  the  type  of  curve  which  fits  the  distribution.  The 
values  of  a,  b,  c,  and  d  are  first  determined  by  the  method  of 
moments  and  then  the  discriminant  expressed  in  terms  of  the 
computed  expressions  for  b,  c,  and  d. 

The  formula  for  K.  the  discriminant  obtained  in  this  way  is 


.2  —  3/3,  —  6)  (4/?2  —  3/^1 

This  formula  for  K  is  derived  as  follows  : 

The  differential  equation  \/y  dy/d.r  =  —  (x  -f  a)/(&  +f.r  -f  d.**)  may 
be  written  (b  +  ex  -{-  dx*}  dy/dx  =  y(x  -f-  a).  Multiplying  each  side  by 
.i'",  we  have 

^xn(b-\-c.r  +  d.r*)dy—  —  \y(x  +  a}xndx.  On  integrating  the  left 
side  by  parts 

.rn  (b  -f  ex  -f-  dx2}  y  —  nb  \  .rn—  *  y  dx  —  (n  -}-  1  )  c  j  ^rny  d.v  —  (n  -f  2)  d 
j^n-f1  y  dx  —  —  IJ3?  ^n+1  rf^-  —  ajy  ^nrf^. 

With  the  usual  notation,  where  A*'n  =  j^n>'  cf^ 


If  A'  is  very  small  at  the  ends  of  the  range  the  first  expression  van- 
ishes and  the  moment  equation   connects  the  three  moments  /*'„_  i,  /*'n, 


On  rearranging  this  equation  ^we  have 


Since   the    moment.    M'O  =  1    and,    if    the    mean    is   taken    as    origin, 
t\  —  0  we  have  for  n  =  0,  1,  2,  3.  respectively  the  four  equations: 

a  —  c=  0 


a  Us  —  3^  —  4  CM.-.  —  od/"4  =  —  A*4 

On  solving  this  set  of  equations  and  substituting  in  the  differential 
or  slope  equation,  we  have 


dy 


+ 


In  terms  of  ft  and  &  this  becomes 


6ft  —  9) 


2  (5ft  —  6ft  —  9) 


138  INTRODUCTION    TO    MATHEMATICAL    STATISTICS 

The    discriminant    of    the    quadratic    denominator    is    the    required 
criterion,  K.    It  is  easily  shown  that 


For,    the    quadratic     expression,    dx*  -f-  ex  -f  b,    may    be    written 

Vc*  —  4bd  1         (          Vc2  —  4bd  } 

x  —  I   x  -\-  -  .      Hence    the    character   of 

2d          )         (  2d        ) 

the  two  factors  depends  on  the  value  of  the  quantity  (c2  —  4bd).  When 
this  is  zero  the  two  factors  are  equal;  when  it  is  negative  there  are  no 
factors,  etc.  Writing  (ca  —  4bd)  in  the  form  (c*/4bd-^  1)  4bd  we  have, 


if  K  = ,  the  following  classes  of  factors  according  to  values  of  K: 

4bd 

If  K  <C  0;  that  is,  if  K  is  negative,  the  factors  are  unequal,  because 
a  negative  sign  for  c'/^bd  must  be  due  to  unlike  signs  for  b  and  d  and 
hence  the  product  4bd  must  be  negative.  That  is,  c'^bd  is  positive  for 
K<0. 

If  K  is  positive  there  are  two  cases,  according  as  K  is  greater  or  less 
than  unity.  If  K  lies  between  zero  and  positive  unity,  (c1  —  4bd)  is  nega- 
tive and  consequently  there  are  no  factors.  If  K  >  1,  (c2  —  4bd)  is 
again  positive  and  the  factors  are  unequal,  etc. 

The  Value  of  K  and  the  Types  of  Curve.  The  following 
table  gives  the  types  of  curves  corresponding  to  the  different 
values  of  K. 

K  <  o,       i.  e.  negative  Type  I. 

{j3l  =  o,  /2o  =  3  Normal  Curve. 

Pi  -  o,  Pi  <  3  Type  II. 

fli  =  o,  ft,  <  3  Type  II. 

K  >  o  <  i  Type  IV. 

K     =       i  Type  V. 

K     >        i,  but  not  indefinitely  large,  Type  VII. 

K     >        i    and  indefinitely  large.  Type  III. 

It  is  to  noted  that  the  types  of  curve  for  any  given  sta- 
tistical distribution  can  now  be  determined  by  strictly  arithmetic 
methods. 

The  only  restriction  on  the  generality  of  the  theory  of  the  criterion 
K  is  that  the  quantity  xn(b  +  cx-\-  dx*)y  must  vanish  at  both  ends  of  the 
range.  This  condition  marks  the  pairs  of  values  of  &  and  &  for  which 
no  curve  of  the  generalized  differential  equation  can  be  found.  The 
limiting  values  of  ft  and  ft  are  ft  >  f  ft  and  ft  >  ft/8  -f  9/2  (see 
Exercises  29  and  30  below). 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS  139 


Exercises. 

28.     Read  the   explanation   to   Tables   XXXV-XLVI   in  "Tables    for 
Statisticians  and   Biometricians." 
29.*     Derive  the    formulas 


|8n(odd)  =  (n-f  1) 

where  a  =  (2ft  —  3ft  —  6)  /  (ft  +  3). 

30.  From    the    computation     formulas     for    Type    II,    prove    that 
m  is  negative  when  ft  <  1.8. 

31.  Prove   from  the   working  formulas  of  Type  I  that  Type   I  in- 
cludes three  sub-classes  according  to  the  signs  of  mi  and  wz2.     Derive  the 
criterion  curve, 

ft(8ft  —  9ft-12)    (4ft  —  3/3a)  =  (10ft  —  12ft  —  18)  2   (£2  +  3)2* 

32.  Prove  that  ft  >ft 

33.  Prove  the  relation  ft  >  15/80  +  9/2. 

34.  Show  that  a  large  value  of  ft  for  the  curves  derived  from  the 
generalized     differential     equation     denotes     a    comparatively    flat-topped 
curve. 

35.  Show  that  for  the  normal  curve  with  a  =  o,  we  have  b  =  <r*, 
c  —  d  =  of  /*«  —  3^. 

The  Computation  Formulas.  The  computation  formulas 
for  the  several  types  of  Pearson's  frequency  curves  are  derived 
in  accordance  with  the  method  of  moments.  For  each  type  as 
many  moment  equations  are  written  as  there  are  constants  in  the 
equation  of  a  curve  of  the  type.  In  some  of  the  type  equations, 
as  in  Type  I  where  at/m1  =  a2/m2,  the  constants  are  connected 
by  equations  so  that  the  number  of  moment  equations  is  reduced. 
The  moment  equations  are  the  result  of  equating  the  theoretical 
moments  of  the  curve  obtained  by  integration  to  the  moments 
computed  directly  from  the  data. 

It  might  be  expected  that  the  differential  equation  in  terms  of  n? 
and  the  jS's  would  be  integrated  to  give  the  equations  directly,  but  the 
present  process  is  more  convenient.  The  chief  purpose,  therefore,  of 
the  slope  or  differential  equation  is  for  the  determination  of  the  type 
forms  of  the  equations.  After  the  algebraic  forms  of  the  equations  are 
determined  each  type  is  worked  out  without  making  use  of  its  connection 
either  with  the  slope  equation  or  with  other  type  forms. 


*  See  page  Ixiii.  of  "Tables  for  Statisticians  and  Biometricians." 


I4O  INTRODUCTION    TO    MATHEMATICAL    STATISTICS 

The  expression  T  (/>),  called  the  gamma  function,  occurs  in 
the  following  formulas.  This  function  is  defined  by  the  relation 

r(/>)  =  (/>  —  i)  r(p—i). 

If  p  is  an  integer,  T(p)  =  \p  —  I. 
If  p  is  not  an  integer,  T  (p)  —  (p  —  i)  (/>  —  2) 
(/>  —  p  +  2)  F  P  where  P  is  the  remainder  after  subtracting  a 
sufficient  number  of  I's  to  bring  p  down  to  between  2  and  i  in 
value.     The   values   of   F    (P)    are   given   in   Table   XXXI    of 
"Tables." 

The  probable  errors  of  K  as  well  as  of  /^  and  ($.2  are  given 
in  "Tables." 

The  derivation  of  the  following  computation  formulas,  ex- 
cept the  moment  formulas,  is  not  possible  without  an  extensive 
acquaintance  with  the  calculus.* 

After  the  constants  in  the  equation  are  computed  the 
smoothed  frequencies  are  obtained  by  computing  the  areas  under 
the  curve  and  between  the  bounding  ordinates.  Thus  the  fre- 
quency of  the  first  class  is  the  area  between  the  ordinate  x  =  \ 
and  x  ~  i \.  Simpson's  quadrature  formula  is  ordinarily  used 
for  finding  the  class  areas.  According  to  this  formula  the  area 
is  1/6  \  v  x  —  i  +'43'  x  +  V  r-j-i  \  where  3'.r-y2  and  3'*  +  * 
are  the  bounding  ordinates  and  3'x  is  the  mid-ordinate  of  the  class. 

Formulas  for  the  Moments. 

S,  —  d. 

v2  =  2.9,  —  d(  i  +d). 

V,=6S4  —  3,,2(i+d)_ d(i  +d)   (2  +  d). 

v4=24^  —  2vs^J2(l+rf)  +  ll  —  v2\6(l+d)      (2-frfj- 


—  3)8,  —  6 


*  See  Elderton  "Frequency  Curves  and  Correlation,"  C.  &  E.  Layton, 
for  a  thoro  discussion  of  the  deriviations. 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS  1 

The  computation  formulas  for  Type  I  are  as    follows 
The  equation  is, 


y  =  Vo 

where  al/ml  =  a2/m,. 
We  have 
r  = 


_     2 


W2  and  wt  are  given  by  the  formulas 

i(r  —  2)  ±i(r  +  2)   V)81e. 

The  constant  mx  is  taken  with  the  negative  root  when  /A3  is  posi- 
tive and  with  the  positive  root  when  /x3  is  negative. 


jj  and  a2  can  be  found  from  the  relations  at  +  «2  = 
^/Wl  =  a2/m2. 

X       ;«,'"'///,'"•-•  r(m1  +  wz,  +  2) 

7'4-2 


The  skew  ness  is 


Mode  =  mean  —  *  —  i  - 


The  formulas  for  Type  II  are  as  follows       The  equation 
for  this  type  is 

.r1 


cr. 


142  INTRODUCTION    TO    MATHEMATICAL    STATISTICS 

The  formulas  are 

—  9 


m  = 


2(3—  & 


~3  —  ft 

N  r(2w-f-2) 


2m  +  1  I 

Type  III.     The  equation  is 
The  formulas  are, 


N       p  f  +  l 

-  ,   where  />  =  va. 
' 


Mode  =  mean  -- 


i 
Skewness  =  — 

°7 
Type  IV.     The  equation  is 

(  r2N     _m  —  J-r 

•'=-'•  ('+?)  -     •' 

The  formulas  are: 


—  3ft  —  6 


i6(r—  i)—  ft(r  —  2)2' 


I    fcr 
2         ^~ 


INTRODUCTION    TO    MATHEMATICAL    STATISTICS  143 

cos  2<£        1 

-\.'r    e—-&  -"» 
Vo  —  ,  where  tan  <£  —  —  • 

Origin  =  mean  -| — . 
r 

i    fia(r  —  2) 
Mode  =  mean  — 

Tvpe  V.     The  equation  is 


The  formulas  are : 

p  =  4  -|  ~  — , 

7  =  (/>  —  2)  \//x2(  p  —  3),  with  sign  same  as  that  of  /x3. 


,  —  3 

SK.  = , 


^    •     • 

Origin  =  mean  —          — . 

P  —  2 
2y 

Mode  =  mean  — 


Type  VI.     The  equation  is 

3'  =  3'o  (*  — o)«« 
The  formulas  are : 

r  = 


6  +  3^  -  2ft 


r        r  +  2 

-+ 
2  4 

r        r  +  2  __ 
2          4 


144  INTRODUCTION    TO    MATHEMATICAL    STATISTICS 


a(<7i  —  0 
Origin  —  mean 


1  jU2      r  +  2 
Mode  =  mean  — 

2  ^3      r  —  2 

Type  VII.     The  equation  is  : 


The  formulas  are  : 

5&  —  9 


A/"  Tm 


Normal  Curve.    The  equation,   as   was  proved  in   Chapter 
VI,  is 

A-2 

N        -—  ; 

y  =  -       -.  e     2a 

V^TTCT 

and  the  curve  was  discussed  in  that  chapter. 


APPENDIX  II 

BLAKEMAN,  J. 

"On  the  Tests  for  Linearity  of  Regression  in  Frequency  Distribu- 
tions", Biometrika,  Vol.  IV,  pp.  332  et  seq. 

BLAKEMAN,  J.  and  PEARSON,  K. 

"On  the  Probable  Error  of  Mean  Square  Contingency",  Biometrika, 
Vol.  V ,  pp.  191  et  seq. 

BOWLEY,  A.  L. 

"Measurements  of  Groups  and  Series",  C.  and  E.  Lay  ton,  1903. 
"Relation  between  the  Accuracy  of  an  Average  and  that  of  its  Con- 
stituent Parts",  Jour.  Roy.  Stat.  Soc.,  Dec.  1897. 
"The  Measurement  of  the  Accuracy  of  an  Average",  Jour.  Roy.  Stat. 
Soc.,  Dec.,  1911. 

BRAVAIS,  A. 

"Analyse  matematique  sur  les  probabilites  des  erreurs  de  situation 
d'un  point",  Memoires  presentcs  par  divers  savants  a  L'acadcmie 
Royale  des  Sciences  de  L'institutc  de  France,  sciences  matematique 
et  physique,  He  serie,  t.  IX,  1846,  p.  255. 

BROWN,  GREENWOOD  and  WOOD. 

"A  Study  of  Index  Correlation",  Jour.  Roy.  Stat.  Soc.,  Feb.  1914. 

CAVE,  BEATRICE  and  PEARSON,  K. 

"Numerical  Illustrations  of  the  Variate-Difference  Correlation 
Method",  Biometrika,  Vol.  X,  pp.  340  et  seq. 

EDGEWORTH,  F.  Y. 

"On  the  Method  of  Least  Squares",  Phil.  Mag.,  Vol.  XVI,  Ser.  5, 

1883,  pp.  360  et  seq. 

"On  Theory  of  Errors  of  Observation  and  the  First  Principles  of 

Statistics",  Camb.  Phil.  Trans.,  Vol.  XIV,  pp.  138  et  seq. 

"Problems    in    Probability",   Phil.   Mag.,    Vol.   XXII,   Ser.    i,    1886, 

PP.  374  et  seq. 

"On   a   New   Method   of   reducing  Observations  relating  to  several 

Quantities",  Phil.  Mag.,   Vol.   XXIV,  Ser.  5,   1887,  pp.  222  et  seq. 

and  Vol.  XXV,  1888,  pp.  184  et  seq. 

"On  Correlated  Averages".  Phil.  Mag..   Vol.  XXXJ]r,  Ser.  5.  1892, 

pp.   190  et  seq. 

10*  (145) 


146  BIBLIOGRAPHY 

EDGEWORTH,  F.  Y. 

"The  Asymmetrical  Probability  Curve",  Phil.  Mag.,  Vol.  XLI,  1896, 

pp.  90  et  seq. 

"Representation  of  Statistics  by  Mathematical  Formulas",  Jour.  Roy. 

Stat.  Soc.,  Dec.  1898;  Sept.  1809;  June  1899;  Mar.  1899;  Mar.  1900. 

"The  Law  of  Error",  Cawb.  Phil.  Trans.,  Vol.  XX,  1905;  pp.  36-65 

and  113-141. 

"On  the  Generalized  Law  of  Error  of  Great  Numbers",  Jour.  Roy. 

Stat.  Soc.,  Sept.  1906. 

"On  the  Representation  of  Statistical  Frequencies  by  a  Series",  Jour. 

Roy.  Stat.  Soc.,  Mar.   7907. 

"On  the  Representation  of  Statistics  by  Analytical  Geometry",  Jour. 

Roy.  Stat.  Soc.,   1914;   Feb.  pp.  300-312;   Mar.   415-432;   May,  653- 

671;  June,  724-749;  July,  838-852. 

Article    on    "Probability"    in    the    Encyclopedia    Brittanica,   Eleventh 

Edition. 

EDGEWORTH,  F.  Y.  and  BOWLEY,  A.  L. 

"Methods  of  Representing  Statistics  of  Wages  and  other  Groups 
not  Fulfilling  the  Normal  Law  of  Error",  Jour.  Roy.  Stat.  Soc., 
June,  1902. 

ELDERTON,  W.  PALIN. 

.  "Frequency   Curves   and   Correlation",   C.   and  E.   Layton,   London, 
1906. 

ELLIS,  LESLIE. 

"The  Method  of  Least  Squares",  Camb.  Phil.  Trans.,  Vol.  VIII, 
pp.  i  et  seq. 

FISHER,  R.  A. 

"On  an  Absolute  Criterion  for  Fitting  Frequency  Curves",  Messen- 
ger, Vol.  XLI,  pp.  165-160. 

GALTON,  FRANCIS. 

"Family  Likeness  in  Stature",  Proc.  Roy.  Soc.,  Vol.  XL,  1886;  pp. 
42  et  seq. 

"Family  Likeness  in  Eye-Color",  Proc.  Roy.  Soc.,  1886;  Vol.  XL, 
pp.  402  et  seq. 

GALTON,  FRANCIS. 

"Correlations  and  their  Measurement",  Proc.  Roy.  Soc.,  Vol.  XLV, 
1888,  pp.  135  et  seq. 

"The  most  Suitable  Proportions  between  First  and  Second  Prizes", 
Biometrika,  Vol.  I,  pp.  385  et  seq. 

HERON,  DAVID. 

"On  the  Probable  Error  of  a  Partial  Coefficient",  Biometrika,  Vol. 
VII,  pp.  411  et  seq. 

"The  Danger  of  Certain  Formulae  Suggested  as  Substitutes  for  the 
Correlation  Coefficient",  Biometrikaf  Vol.  VIII,  pp.  109  et  seq. 


IMI5LIOGRAPHY  147 

HOOKER,  R.  H. 

"Correlation  of  the  Marriage  Rate  with  Trade",  Jour.  Roy.  Stat. 
Soc.,  Sept.  1901. 

"Correlation  of  the  Weather  and  Crops",  Jour.  Roy.  Stat.  Soc., 
Mar.  1907. 

ISSERLIS,    L. 

"On  the  Partial  Correlation  Ratio",  Biometrika,  Vol.  X,  pp.  391 
et  seq.,  also  Vol.  XL 

"The  Application  of  Solid  Hypergeometrical  Series  to  Frequency 
Distribution  in  Space",  Phil  Mag.,  Vol.  XXVIII,  Ser.  6,  1914,  pp. 
379  et  seq. 

KEYNES,  J.  M. 

"Principal  Averages  and  the  Laws  of  Error  which  lead  to  them", 
Jour.  Roy.  Stat.  Soc.,  Feb.  1911. 

NIXON,  J.  W. 

"An  Experimental  Test  of  the  Normal  Law  of  Error",  Jour.  Roy. 
Stat.  Soc.,  June,  1913. 

PEARSON,  KARL. 

"Mathematical   Contributions   to   the  Theory  of   Evolution", 

I.  "On  the  Dissection  of  Asymmetrical  Frequency 
Curves",  Phil.  Trans.,  1894,  Vol.  CLXXXV,  A,  part 
I,  pp.  187  et  seq. 

II.  "Skew  Variations  in  Homogeneous  Material",  Phil. 
Trans.,  1895,  Vol.  CLXXXV  I,  A,  pp.  343  et  seq. 

III.  "Regression,    Heredity    and    Panmixia",   Phil    Trans., 
1896,  Vol.  CLXXXV  1 1  A,  pp.  253  et  seq. 

IV.  "On  Probable  Errors  of  Frequency  Constants  and  on 
the  Influence  of  Random  Selection  on  Variation  and 
Correlation",  Phil.  Trans.  1898,  Vol.  CXCI  A,  pp,  229 
et  seq.     (In  Collaboration  with  L.  N.  G.  Filon.) 

V.     "On  the  Reconstruction  of  the  Stature  of  Prehistoric 
Races",   Phil.   Trans.,   1892,    Vol.   CXCII   A,  pp.   169 
et  seq. 
VI.     "General  Selection",  Phil.  Trans.,  1899,  Vol.  CXCII  A, 

pp.  257  et  seq. 

VII.  "On  the  Correlation  of  Characteristics  not  Quanti- 
tatively Measurable",  Phil.  Trans.,  1909,  Vol.  CXCV 
A,  pp.  I  et  seq. 

VIII.  "On  the  Inheritance  of  Characters  not  Capable  of 
Exact  Quantitative  Measurements",  Phil.  Trans.,  1901, 
Vol.  CXCV  A,  pp.  79  et  seq. 

IX.  "On  the  Principles  of  Homotyposis  and  its  Relation 
to  Heredity,  to  the  Variability  of  the  Individual  and 
to  that  of  the  Race",  Phil.  Trans.,  1901,  Vol.  CXCVII 
A,  pp.  28$  et  seq. 


148  BIBLIOGRAPHY 

PEARSON,  KARL. 

"Mathematical  Contributions  to  the  Theory  of  Evolution"  —  Continued. 
X.     "Supplement  to  a  Memoir  on  Skew  Variation'1,  Phil. 

Trans.,  190!,  Vol.  CXCVII  A,  pp.  445  et  seq. 
XI.     "On  the  Influence  of  Natural  Selection  on  the  Varia- 
bility and  Correlation  of  Organs",  Phil.   Trans.,  Vol. 
CC,  A,  1903,  pp.  i  et  seq. 

XII.  "On  a  Generalized  Theory  of  Alternative  Inheritance 
with  Special  Reference  to  Mendel's  Law",  Phil. 
Trans.,  1904,  Vol.  CCIII  A,  pp.  53  et  seq. 

XIII.  "On  the   Theory  of   Contingency  and  its   Relation  to 
Association    and    Normal    Correlation",    Drapers'    Co. 
Res.  Mem.,  Biometric  Series  I ,  Dulau  &  Co.,  London, 
1904. 

XIV.  "On   the    General    Theory    of    Skew    Correlation    and 
Non-Linear    Regression",    Drapers'    Co.    Res.    Mem., 
Dulau  &  Co.,  1905. 

XV.     "A    Mathematical    Theory    of    Random    Migration", 
Drapers'  Co.  Res.  Mem.,  Biometric  Series  II,  1906. 
(In  Collaboration  with  Blakeman}. 

XVI.  "On  Further  Methods  of  Determining  Correlation", 
Drapers'  Co.  Res.  Mem.,  Biometric  Series  IV,  Dulau 
&  Co.,  London,  1907. 

XVIII.     "On  a  Novel  Method  of  Regarding  Association,  etc.", 
Biometric  Series   VII,   1912,  Drapers'  Co.  Res.  Mem. 

"On  a  Form  of  Spurious  Correlation  due  to  Indices",  Proc.  Roy. 
Soc.,  Vol.  LX,  1897,  pp.  489  et  seq. 

"On  a  Criterion  that  a  given  System  of  Deviations  from  the  Prob- 
able in  the  Case  of  Correlated  System  of  Variables  is  such  that  it 
can  be  reasonably  supposed  to  have  arisen  from  Random  Sampling", 
Phil.  Mag.,  Ser.  5,  Vol.  L,  1900,  pp.  7.57  et  seq. 

"On    Lines    and    Planes    of    Closest    Fit    to    Systems    of    Points    in 
Space",  Phil.  Mag.,  Ser.  6,  Vol.  II,  1901,  pp.  559  et  seq. 
"On  the  Systematic  Fitting  of  Curves  to  Observations  and  Measure- 
ments",  Biometrika.  I,   pp.   265   ct   seq.   and   Biometrika,  II,  pp.   i 
et  seq. 

"On  the  Probable  Errors  of  Frequency  Constants",  Biometrika,  II, 
pp.  273  et  seq.;  also  Vol.  IX,  pp.  i  et  seq. 

"Elementary  Proof  of  Sheppard's  Formulae,  etc.",  Biometrika,  Vol. 
Ill,  pp.  308  et  seq. 

"On  the  Generalized  Probable  Error  in  Multiple  Normal  Correla- 
tion", Biometrika,  Vol.  VI,  1908,  pp.  59  et  seq.  With  Alice  Lee 
"On  a  New  Method  of  Determining  Correlation  between  a  Meas- 
ured Character  A  and  a  Character  B  of  which  only  the  Percentage 
of  Cases  wherein  B  exceeds  (or  falls  short  of)  a  given  Intensity 
is  recorded  for  each  Grade  of  A",  Biometrika,  Vol.  VII,  1909,  pp. 
96  et  seq. 


iiY  149 

PKAKSON,    KARL. 

"Mathematical  Contributions  to  the  Theory  of  Evolution" — Concluded. 
"On  a  New  Method  of  Determining  Correlation  when  one  Variable 
"s    given    by    Alternative    and    the    other    by    Multiple    Categories", 
Biomctrika,  Vol.  VII,  1910,  pp,  248  et  seq. 

''On  a  Correction  to  be  made  to  the  Correlation  ratio  "n",  Biomet- 
rika,  Vol.  VIII,  pp.  254  et  seq. 

'  On  the   Probable   Error  of   a   Coefficient  of   Correlation  as   found 
from  a  fourfold  Table",  Biotnetrika,  Vol.  IX,  pp.  22,  et  seq. 
"On  the   Measurement  of   the   Influence   of    'Broad   Categories'   on 
Correlation",  Biomctrika,  Vol.  IX,  pp.  166,  et  seq. 

PEARSON,  K.   (Editor). 

"Tables  for  Statisticians  and  Biometricians",  Cambridge  University. 
Press,  1914. 

PEARSON,  K.  and  HERON,  DAVID. 

"On  Theories  of  Association",  Biomctrika,  Vol.  IX,  pp.  158  et  seq. 
PERSONS,  WARREN. 

"The  Correlation  of  Economic  Statistics",  Amer.  Stat.  Assoc.,  Vol. 

XII,  Dec.  1910. 

SHEPPARD,  W.  F. 

"On  Application  of  the  Theory  of  Error  to  Cases  of  Normal  Dis- 
tribution and  Normal  Correlation",  Phil.  Trans.,  1899,  Vol.  CXCII, 
A,  pb.  loi  ct  seq. 

"On  the  Calculation  of  the  most  probable  Values  of  the  Frequency 
Constants  for  Data  arranged  according  to  equi-distant  Divisions  of 
a  Scale",  Proc.  Lon.  Math.  Soc.,  Vol.   XXIX,  pp.  353-380. 
"On    the    Use    of    Auxiliary    Curves    in    Statistics    of    Continuous 
Variates",  Jour.  Roy.  Stat.  Soc.,  Sept.  1900. 

SNOW,  E.  C. 

"The  Application  of  the  Method  of  Multiple  Correlation  to  the 
Estimate  of  Post  Censal  Population",  Jour.  Roy.  Stat.  Soc.,  May, 
1911. 

SPEARMAN,  C. 

"The     Proof     and     Measurement     of     Association     between     Two 
Thing',",  Amcr.  Jour,   of  Psychology,  Vol.  XV,  1904,  pp.  88  et  seq. 
"Dem  jnstration   of   Formulae    for   True    Measurement   of    Correla- 
tion", Amcr.  Jour,  of  Psych.,  Vol.  XVIII,  1907,  pp.  161  et  seq. 
"A    Foot-rule    for    Measuring   Correlation",    Brit.   Jour,    of  Psych., 
Vol.  If,  1906  pp.  RT  et  seq.;  also  Vol.  II,  part  v,  pp.  107-108. 
"Correlation   calculated    from    Faulty   Data",   Brit.  Jour,   of  Psych., 
Vol.  Ill,  /y/o,  pp.  271  et  seq. 

"STUDENT". 

"The  Elimination  of  Spurious  Correlation  due  to  Position  in  Time 
or  Space",  Biomctrika,  Vol.  X,  pp.  799  et  seq. 


I5O  BIBLIOGRAPHY 

YULE,  G.  U.  . 

"On  the  Significance  of  Bravais'  Formulae  for  Regression,  etc.,  in 
the  case  of  Skew  Correlation",  Proc.  Roy.  Soc.,  Vol.  LX,  1897, 
pp.  477  et  seq. 

"On  the  Association  of  Attributes  in  Statistics",  Phil.  Trans.,  1900, 
Vol.  CXCIV,  A,  pp.  257  et  seq. 

"On  the  Theory  of  Consistence  of  Logical  Class  Frequencies  and 
its  Geometrical  Representations",  Phil.  Trans.,  1901,  Vol.  CXCVII, 
A,  pp.  91  et  seq. 

"On  the  Theory  of  Correlation  for  any  Number  of  Variables 
treated  by  a  New  System  of  Notation",  Proc.  Roy.  Soc.,  Ser.  A, 
Vol.  LXXIX,  1907,  pp.  182  et  seq. 

"The  Application  of  the  Methods  of  Correlation  to  Social  Economic 
Statistics",  Jour.  Roy.  Stat.  Soc.,  Dec.  /pop. 

"On  Interpretation  of  Correlation  between  Indices  or  Ratios",  Jour. 
Roy.  Stat.  Soc.,  June,  1910. 

"On  the  Methods  of  Measuring  Association  between  Two  Attrib- 
utes", Jour.  Roy.  Stat.  Soc.,  May,  1912. 


(<*) 


14  DAY  USE 

RETURN  TO  DESK  FROM  WHICH  BORROWED 
LOAN  DEPT. 

This  book  is  due  on  the  last  date  stamped  below,  or 

on  the  date  to  which  renewed. 
Renewed  books  are  subject  to  immediate  recall. 


11  APR'59JB 



T  '--•               '  '  '   "'•      f»,O 



MAR  2  8  1959 


270ct'60BM 

1      29W64pG 

LD  21A-50m-9,'58 
(6889slO)476B 


General  Library 

University  of  California 

Berkeley 


4  Apr  °  •  v 

REC'D  LD 

MAR  3  1  1961 

REC'D  LD 

THE  UNIVERSITY  OF  CALIFORNIA  LIBRARY 


